Mean time to resolution (MTTR) is defined as the sum of the total amount of time that service was interrupted divided by the number of individual incidents. The unit of measurement is some quantity of time. Ideally, you can use minutes as the unit.
total downtime / # of incidents
That is, unless you blacked out the eastern seaboard for weeks!
Let’s say you look at your stats for a given month and your app was down for 20 minutes on one day, an hour a few days later, and then almost made it through the month but was down 70 minutes a couple of weeks later. That’s 150 minutes over three incidents, or an MTTR of 50 minutes. That’s a very good MTTR, by the way. But the closer to zero you can get, the better.
Today we’ll learn more about MTTR, and how it will help your team deliver rock-solid reliability and performance.
Of course, the problem with statistics is that it’s easy to make them lie, or at least make them misleading. You need to compare apples to apples. Is one incident simply an error happening for a percentage of users in one part of the app while another took the whole site down? Some amount of customer frustration is better than a lack of income for an hour.
Considering this, it’s reasonable to have separate MTTR groups for different levels of seriousness. Your defect tracking approach should include priority levels. These can map to or work in tandem with MTTR priorities.
It’s also possible to resolve an issue in a temporary or hacky way. But what if that fix adds technical debt or doesn’t truly address the underlying issue? Is the incident truly resolved? This is a hard question and will vary for different applications and development approaches. The most important thing is to be consistent and not compare apples to oranges.
Obviously, the most important part of fixing a problem is knowing you have one. You need to set up good alerts, watch the right things, and automatically notify the right people. This means an email or text message if the problem is serious enough.
But you also don’t want to go throwing in alerts for every possible scenario. This could result in alert noise and lead to missing serious issues because people have stopped paying attention. Things that can turn into problems down the road should not be ignored. They should be aggregated and reported at intervals. Serious issues should wake the dead.
Next, you need to alert the right people—and only the right people. Having clearly defined roles is crucial: maybe someone directing the effort, someone primarily working on the solution, and someone responsible for communicating with others. And these people need to understand their roles in advance and have communication strategies in place.
Redundancy is another important part of alerting. If the alert goes unacknowledged, is there a backup person? Rapid acknowledgment is also important so that the backup person isn’t bothered unnecessarily and everyone knows the issue is being worked on. That even has its own metric: mean time to acknowledgment.
Finally, here’s a key strategy in reducing your MTTR: break things as little as possible!
Okay, that sounds ridiculously obvious. But it’s deceptively complex and rich for implementation. If you are deploying massive changes once a month, it’s more likely something will break.
If instead, you are rolling out very small changes many times a day, it’s less likely that any of these will break something and even less likely it will be catastrophic. Similarly, if only certain people are allowed to deploy or fix things, it creates an artificial bottleneck and pushes you closer to a single point of failure.
Basically, the point of measuring MTTR is to help you fix things more rapidly. The secret to a sane and profitable development lifecycle is having solid processes in place for all parts of it. This means accepting that things break and then putting in a solid system for addressing that situation.
As we’ve seen, monitoring is the key to ensuring that your application is working the way you expect and letting you know when it isn’t. These days there are several services that do this for you. Retrace is a good one and worth checking out!
Let’s face it, we spend phenomenal amounts of money developing software projects because they make so much money for us. Whether you’re working for a nimble startup, shooting for the next disruptive technology, or banging on one of many internal apps for an industrial behemoth, your work exists only because it will make money for the company.
Therefore, once your hard work is up and running, it’s like a money engine, pumping the green stuff into your company. As long as this works, everyone’s happy. Bonus time! When it stops, everyone’s sad, if not downright panicking. Time to update your resume?
For an important piece of software, you could be losing a significant amount of money every minute it’s down. Even for a relatively small web startup with a million dollars per year in revenue, being down for an hour means a loss of over one hundred dollars. Do you want to cough that up?
But that’s a very tame example. A survey done several years ago found that an hour of downtime can cost more than $300,000! An expensive outage could lead to cutbacks, or even to the failure of your company. You want to keep your job, right?
The history of software engineering goes back about seventy years—less than an average human lifetime. We have learned a lot of valuable lessons in that time.
We learned 40 years ago that building software isn’t like building a bridge. And yet it was still a couple more decades before we really started to understand this. Along the way, we also realized that it’s important to fully test everything we do, both with automated tests and with humans actually using the application.
And the most recent revolution, DevOps, has helped us realize that deploying and maintaining our software is every bit as essential as writing the software itself and must be treated as such. Setting up our systems and keeping them running should be done with carefully written, reviewed, and tested code just like the code it supports.
DevOps grew out of traditional systems administration, and monitoring systems has long been an important part of administering them. No one wants nasty surprises at 3 AM. If you have proper monitoring set up, you should know about a problem as soon as it happens. In fact, the best monitoring will tell you about a problem before it happens. This culture of monitoring is every bit as important in the world of DevOps.
On the surface, monitoring is as simple as keeping an eye on something. Is this computer still running? Is this application still processing requests? How’s our CPU usage? Disk space? Memory? Determining how things are going is largely about measuring—about recording metrics. But you don’t measure things solely about your systems and applications. You measure how your team performs its work.
You can measure a lot of development and deployment statistics. How often do you deploy new code? How many bugs make it into production? Is the app throwing a ton of errors? All these numbers tell you how your team is doing. You can even combine metrics to tell you if your customers are happy.
It’s easy to get overwhelmed with all these numbers to consider. How do you really know how you’re doing? Isn’t there one number that tells the tale?
Well, things such as the speed of the app or the customer satisfaction index tell you how you’re doing at any given moment, and you can track them over time. But the history of software is a history of mistakes. A history of breaking things.
If we accept that some failure is inevitable, it would seem important to know that we address and fix that failure as quickly as possible. When something breaks, how quickly do we hear about it and get it fixed? That’s what MTTR is all about.
If you would like to be a guest contributor to the Stackify blog please reach out to [email protected]