Mean time to recovery is a DORA metric, which is a set of five factors that measure the productivity, velocity, and efficiency of your software development teams.
In this article, we’ll discuss how to measure your mean time to recovery, what kinds of deployments count as a failure, and how to reduce your rate of failure by instituting best practices.
What is mean time to recovery?
Mean time to recovery most often refers to the average amount of time it takes to resolve a failure in your software. In addition to other DORA metrics like change failure rate and lead time for changes, mean time to recovery (MTTR) indicates an expected level of throughput related to the reliability of your platform, making it a key metric for service-level agreements and other common software expectations.
However, the “R” in MTTR can refer to one of four potential metrics, each of which carries a different definition:
- Repair: The amount of time it takes to push a repair to production
- Recovery: The amount of time to fully restore the service
- Respond: The amount of time it takes someone on your team to respond to the incident
- Resolve: The amount of time it takes, in business hours, to remedy one incident or the cause of repeated incidents
As we’ll discuss shortly, deciding which of the four Rs you want to use in your definition will have an impact on the metrics you gather. It’s important to make a decision collectively with your team on which definition you’ll use.
Why is it important to measure MTTR?
MTTR, when it refers to recovery, is a useful metric for determining the ability and average speed of your team to handle incidents and bring software back to full operational status. MTTR sometimes appears in service-level agreements (SLAs) as a result, to indicate software quality and ensure that any service outages or incidents that occur will be repaired or restored within a given timeframe.
The average duration of a service outage or service interruption can be helpful information for smaller or newer teams who are trying to reach platform stability, as well as bigger teams managing large service loads.
At companies with large engineering teams and thousands of developers, site reliability engineers (SREs) can use MTTR as a KPI to track performance and suggest that teams make improvements on code quality, documentation, or service management workflows. Tracking MTTR will also make it easier to bring multiple teams in line with a set of standard expectations for software quality and development.
A shorter MTTR typically indicates a stable platform and a nimble development team who is capable of managing incidents quickly. Shorter MTTRs often correlate with reliable user experiences for end users, which has a high correlation with return usership and lower customer acquisition costs. A longer MTTR may be an indication that your team needs better alerting in place or other improvements at the code level. We’ll discuss these factors in the next section.
How to measure MTTR
As with most DORA metrics, there is a level of subjectivity involved in determining what constitutes an outage. An outage for one company may be very different from another company’s definition for outage, and each may come with their own set of consequences. Gather your team to define outages before proceeding; some definitions our team has seen in the industry include:
- Incidents pulled from incident management software
- Alerts from observability tools
- Urgent bugs from project tracking software (i.e., Jira)
- Internal escalations
Mean time to recovery is an average of several incidents across a given period of time, so it is key to track your time to recovery for each incident before measuring MTTR. To calculate time to recovery for each, you’ll want:
- The exact time of the reported outage
- The time of notification from your monitoring tool
- The time of full service restoration
Let’s say our outage began at 3 pm, we were notified at 3:02 pm, and service was restored at 6:15 pm. Our total time to recover from this incident is three hours and 15 minutes.
We can break this down further using the other definitions we discussed as well:
- Our time to repair includes the period starting when the team received the outage notification, and when the repair was completed — a total of three hours and 12 minutes
- Our time to respond is relatively brief — a total of two minutes — so we can conclude that our example company has an effective monitoring and alerting system for incidents
To measure your total time to resolve an incident or outage, you may need to consider whether other factors are causing incidents to incorporate the time it takes for an external service to repair an issue or to update your dependencies. Though these may not be considered important for measuring the quality of your team’s software, which is half the goal of DORA metrics, it may help you identify problematic tools or workflows that need to be rethought.
Improve your mean time to recovery
Recovery encompasses several potential areas for improvement depending on your team’s performance:
- Issue detection, whether that is manual or automated, and whether you rely on a customer to provide the issue notification
- Your alerting systems
- The quality of your code, testing, and deployment pipeline
- Communication channels between your clients and on-call developers
Software solutions exist for issue detection and alerting, but even if you have these in place, it may be worth revisiting your systems to ensure notifications aren’t muted or going to disengaged users. Routing your alert systems to the right people is key, but you may also want to create a ranking system to indicate the severity of failures or issues.
Code quality, though not directly part of mean time to recovery, plays a huge role in new failures and outages. If code isn’t properly tested or reviewed before release, it can lead to rollbacks, failures, and other incidents. Similarly, if core dependencies aren’t managed appropriately, they can lead to failures that are more difficult to detect. Improving the quality of your code and creating standards around software development expectations can make a positive impact across your MTTR measurements.
Use an internal developer portal to track your MTTR
There are many ways to surface data related to failures, and using an internal developer portal can make it easy to aggregate and study data from multiple sources.
With Port, your entire software stack is accessible through a single pane of glass. This makes it easy to draw upon many sources during your research for a more comprehensive look at your software stack. Port also reduces the need to switch between multiple contexts to gather data — and you can track all of your metrics in custom views by project, service, or based on persona.
In the below three screenshots (and in this view in our live demo), you can see a custom dashboard built for on-call developers, with special panes to view current incidents, who is on call, options for communicating the incident, and options for managing the incident:

What Port makes even easier is reconciling your performance against your standards. With all of your data living in one place, you gain the opportunity to adopt holistic measurement and improvement practices using Scorecards, which allow you to set standards for things like minimum pull request review or incident response times, and gauge your team’s performance in one place.
But this is not the only benefit to using a portal: your development team also gains the autonomy to act on incidents independently. Each option for communicating the incident is a self-service action, built by platform or DevOps engineers to abstract away the complexity of putting in tickets to perform common tasks like triggering incidents, opening new cases, or making announcements:

Then, lower down on the page, the on-call developer can choose from another set of self-service actions, also known as day-2 operations, which include requesting access to new services, rolling back a problematic deployment, or re-syncing an application after a change has been made:

This saves developers time and tickets — they no longer have to wait for a response from their SRE or DevOps engineer, but can begin work immediately, shortening not only their time to respond to an incident, but their overall mean time to recovery.
Port also allows you to create custom dashboards for developers that help them plan their days and monitor their own work performance. In the below screenshot is a view of a developer’s Home page, which shows the scorecards they are responsible for, among other things:

Engaged developers are developers who can easily understand where their efforts go and how much of an impact they have on the development of new software — with scorecards easily available to view, developers gain:
- More context on the results of their work
- More clarity on where they need to improve and where they are successfully meeting expectations
- More autonomy, via self-service actions, to address performance issues or gaps independently
To learn more about scorecards, self-service actions, or how an internal developer portal can help you reduce MTTR, book a demo with us.