What Is MTTR? A Simple Definition That Will Help Your Team

By: Ben Munat

| December 13, 2024

Mean time to resolution (MTTR) is an essential (and arguably the most important) metric to watch in a modern, DevOps-based software development lifecycle. It essentially tells you how quickly your company can bounce back from an unplanned outage. It’s the sum of the total amount of time that service was interrupted divided by the number of individual incidents.

Calculating MTTR and keeping it low will help your company make more money and keep customers happier. It’s as simple as this: take the total amount of time a system or service was unavailable in a given period and then divide that by the total number of incidents in that time frame. This gives you one number to track that applies whether you have one long downtime or lots of short ones.

Ideally, you can use minutes as the period of time.

total downtime / # of incidents

That is, unless you blacked out the eastern seaboard for weeks!

Let’s say you look at your stats for a given month and your app was down for 20 minutes on one day, an hour a few days later, and then almost made it through the month but was down 70 minutes a couple of weeks later. That’s 150 minutes over three incidents, or an MTTR of 50 minutes. That’s a very good MTTR, by the way. But the closer to zero you can get, the better.

Today we’ll learn more about MTTR, and how it will help your team deliver rock-solid reliability and performance.

Measuring MTTR Correctly

Of course, the problem with statistics is that it’s easy to make them lie or at least make them misleading. You need to compare apples to apples. Is one incident simply an error happening for a percentage of users in one part of the app while another took the whole site down? Some amount of customer frustration is better than a lack of income for an hour.

Considering this, it’s reasonable to have separate MTTR groups for different levels of seriousness. Your defect tracking approach should include priority levels. These can map to or work in tandem with MTTR priorities.

It’s also possible to resolve an issue in a temporary or hacky way. But what if that fix adds technical debt or doesn’t truly address the underlying issue? Is the incident truly resolved? This is a hard question and will vary for different applications and development approaches. The most important thing is to be consistent and not compare apples to oranges.

Resolution Factors

Obviously, the most important part of fixing a problem is knowing you have one. You need to set up good alerts, watch the right things, and automatically notify the right people. This means an email or text message if the problem is serious enough.

But you also don’t want to go throwing in alerts for every possible scenario. This could result in alert noise and lead to missing serious issues because people have stopped paying attention. Things that can turn into problems down the road should not be ignored. They should be aggregated and reported at intervals. Serious issues should wake the dead.

Next, you need to alert the right people—and only the right people. Having clearly defined roles is crucial: maybe someone directing the effort, someone primarily working on the solution, and someone responsible for communicating with others. And these people need to understand their roles in advance and have communication strategies in place.

Redundancy is another important part of alerting. If the alert goes unacknowledged, is there a backup person? Rapid acknowledgment is also important so that the backup person isn’t bothered unnecessarily and everyone knows the issue is being worked on. That even has its own metric: mean time to acknowledgment.

The MTTR Feedback Loop

Finally, here’s a key strategy in reducing your MTTR: break things as little as possible!

Okay, that sounds ridiculously obvious. But it’s deceptively complex and rich for implementation. If you are deploying massive changes once a month, it’s more likely something will break.

If instead, you are rolling out very small changes many times a day, it’s less likely that any of these will break something and even less likely it will be catastrophic. Similarly, if only certain people are allowed to deploy or fix things, it creates an artificial bottleneck and pushes you closer to a single point of failure.

Basically, the point of measuring MTTR is to help you fix things more rapidly. The secret to a sane and profitable development lifecycle is having solid processes in place for all parts of it. This means accepting that things break and then putting in a solid system for addressing that situation.

As we’ve seen, monitoring is the key to ensuring that your application is working the way you expect and letting you know when it isn’t. These days there are several services that do this for you. Retrace is a good one and worth checking out!

The Money Engine

Let’s face it, we spend phenomenal amounts of money developing software projects because they make so much money for us. Whether you’re working for a nimble startup, shooting for the next disruptive technology, or banging on one of many internal apps for an industrial behemoth, your work exists only because it will make money for the company.

Therefore, once your hard work is up and running, it’s like a money engine, pumping the green stuff into your company. As long as this works, everyone’s happy. Bonus time! When it stops, everyone’s sad, if not downright panicking. Time to update your resume?

For an important piece of software, you could be losing a significant amount of money every minute it’s down. Even for a relatively small web startup with a million dollars per year in revenue, being down for an hour means a loss of over one hundred dollars. Do you want to cough that up?

But that’s a very tame example. A survey done several years ago found that an hour of downtime can cost more than $300,000! An expensive outage could lead to cutbacks, or even to the failure of your company. You want to keep your job, right?

Lessons Learned

Back in the pre-cloud, pre-DevOps days, MTTR meant something else. There were actually a number of metrics (MTBF, MTTF, etc.) that applied pretty much exclusively to hardware. Back then, calculating MTTR was more about fixing servers. For example, how fast can you get a new drive up and running in your RAID array?

The way that traditional SysAdmins kept their MTTR low was a mixture of black magic and siege warfare. Hardware manufacturers would publish failure statistics for their equipment, and admins would keep track of how long a given device had been in use. They would generally remove a device before it failed, rather than risk a downtime.

Redundancy was another key tool for survival. Instead of relying on a single drive, put a group of drives in a RAID array. Mimic this approach for any key server or device. Master and slave machines everywhere.

The most interesting aspect of this old-school approach is that it was all about avoiding downtime. That might sound logical. After all, no one wants to have an outage and lose money. But this also led to the “don’t wake the baby” syndrome: a pervasive mentality of not making changes as long as things were running.

This live-in-fear attitude went hand-in-hand with other crippling philosophies such as the waterfall development approach, monolithic architectures, and infrequent, massive deployments. Oh, and the SysAdmin was the grand wizard that kept things running and the only one allowed to touch anything!

When Clouds Are Sunshine

The advent of cloud computing brought a myriad of benefits to the modern software world. Probably the most immediately obvious of these is that your SysAdmin no longer has to keep track of hard drive life and shiver in the network operations center.

The commoditization of information technology services moved the hardware maintenance responsibilities out of your company and consolidated them in the mammoth data centers of a few tech giants. Say what you will about the loss of control this meant—anyone who would argue for tracking drive life and managing periodic server upgrades is not a reasonable human being.

Cloud computing meant that bringing another server online was as simple as running a script or even just clicking a few links on a web page. The smart players did this by using code and checking that code into version control. This increased visibility into how the company’s infrastructure worked and allowed more people to get involved. Behold the glory of DevOps!

MTTR as a DevOps Driver

Whichever sort of infrastructure your team deploys to, modern monitoring, agile software, and DevOps practices can help you deliver outstanding service.

As mentioned earlier, traditional system administration practices often drove teams to a culture of fear. Don’t wake the sleeping baby! But just as the trend toward agile processes allows for experimentation and course-correction in software development, so has DevOps given developers the tools they need to move quickly and efficiently.

An ideal software development lifecycle consists of small, well-measured goals broken into reasonable chunks. This should include absolute transparency about direction and process. Modern continuous integration and continuous delivery further guards against missteps. Monitoring and metrics should tell you that things are on the right track. DevOps should give you the tools to experiment and verify.

This includes embracing a blameless culture, consistently calculating MTTR, and incorporating that into a feedback loop that allows the delivery of new features and recovering from incidents at an extremely high rate. The DORA report I mentioned earlier found that “elite” performers recovered from an incident 2,604 times faster than “low” performers. Wow!

Measure Twice, Cut Once

On the surface, monitoring is as simple as keeping an eye on something. Is this computer still running? Is this application still processing requests? How’s our CPU usage? Disk space? Memory? Determining how things are going is largely about measuring—about recording metrics. But you don’t measure things solely about your systems and applications. You measure how your team performs its work.

You can measure a lot of development and deployment statistics. How often do you deploy new code? How many bugs make it into production? Is the app throwing a ton of errors? All these numbers tell you how your team is doing. You can even combine metrics to tell you if your customers are happy.

It’s easy to get overwhelmed with all these numbers to consider. How do you really know how you’re doing? Isn’t there one number that tells the tale?

Well, things such as the speed of the app or the customer satisfaction index tell you how you’re doing at any given moment, and you can track them over time. But the history of software is a history of mistakes. A history of breaking things.

If we accept that some failure is inevitable, it would seem important to know that we address and fix that failure as quickly as possible. When something breaks, how quickly do we hear about it and get it fixed? That’s what MTTR is all about.

Improve Your Code with Retrace APM

Stackify's APM tools are used by thousands of .NET, Java, PHP, Node.js, Python, & Ruby developers all over the world.
Explore Retrace's product features to learn more.

Learn More

Author

Ben Munat