Calculating MTTR: An Evolution Driven by the Rise of DevOps

Calculating MTTR: An Evolution Driven by the Rise of DevOps

Ben Munat Developer Tips, Tricks & Resources

The shift to cloud computing and the DevOps revolution have fueled some important changes in the way we think about software development and monitoring. It has delivered huge benefits to the companies that have fully embraced the approach.

In fact, the DevOps Research and Assessment (DORA) 2018 industry survey found a new small group of “elite” performers that are deploying code far more often and having a far better mean time to resolution (MTTR) than the next closest group. These companies are moving at light speed compared to the dinosaurs stuck with traditional sysops on their own self-managed hardware.

MTTR is an essential (and arguably the most important) metric to watch in a modern, DevOps-based software development lifecycle. It essentially tells you how quickly your company can bounce back from an unplanned outage.

Calculating MTTR and keeping it low will help your company make more money and keep customers happier. It’s as simple as this: take the total amount of time a system or service was unavailable in a given period and then divide that by the total number of incidents in that time frame. This gives you one number to track that applies whether you have one long downtime or lots of short ones.

In this post, we’ll discuss how DevOps and cloud computing have changed the way we measure MTTR, and how these new approaches have gone hand-in-hand with measuring MTTR to deliver solid performance, high reliability, and rapid deployment of new features.

The bad old days of MTTR

Back in the pre-cloud, pre-DevOps days, MTTR meant something else. There were actually a number of metrics (MTBF, MTTF, etc.) that applied pretty much exclusively to hardware. Back then, calculating MTTR was more about fixing servers. For example, how fast can you get a new drive up and running in your RAID array?

The way that traditional SysAdmins kept their MTTR low was a mixture of black magic and siege warfare. Hardware manufacturers would publish failure statistics for their equipment, and admins would keep track of how long a given device had been in use. They would generally remove a device before it failed, rather than risk a downtime.

Redundancy was another key tool for survival. Instead of relying on a single drive, put a group of drives in a RAID array. Mimic this approach for any key server or device. Master and slave machines everywhere.

The most interesting aspect of this old-school approach is that it was all about avoiding downtime. That might sound logical. After all, no one wants to have an outage and lose money. But this also led to the “don’t wake the baby” syndrome: a pervasive mentality of not making changes as long as things were running.

This live-in-fear attitude went hand-in-hand with other crippling philosophies such as the waterfall development approach, monolithic architectures, and infrequent, massive deployments. Oh, and the SysAdmin was the grand wizard that kept things running and the only one allowed to touch anything!


Stackify Loves Developers

When clouds are sunshine

The advent of cloud computing brought a myriad of benefits to the modern software world. Probably the most immediately obvious of these is that your SysAdmin no longer has to keep track of hard drive life and shiver in the network operations center.

The commoditization of information technology services moved the hardware maintenance responsibilities out of your company and consolidated them in the mammoth data centers of a few tech giants. Say what you will about the loss of control this meant—anyone who would argue for tracking drive life and managing periodic server upgrades is not a reasonable human being.

Cloud computing meant that bringing another server online was as simple as running a script or even just clicking a few links on a web page. The smart players did this by using code and checking that code into version control. This increased visibility into how the company’s infrastructure worked and allowed more people to get involved. Behold the glory of DevOps!


Stackify Loves Developers

A rainbow of services

As the cloud computing revolution spread across the IT landscape, it naturally congealed around a few traditional services. At its most fundamental, providing infrastructure as a service (IaaS) is the obvious competitor to a traditional information services department.

Instead of racking up your own machines in a data center somewhere and being totally on the hook for their functioning, you can sign up with AWS, Azure, or Google Cloud Platform and just provision virtual machines with the specs you want. You don’t need to worry about how long that hard drive will last anymore. Also, if you have a sudden surge of interest in your app, just spin up another server.

However, you still need to monitor your cluster carefully. In this scenario, MTTR may not be about drive replacement, but there is still plenty that can go wrong. You are responsible for everything from the operating system and up. Besides bugs in your code, things that could break include: logs or job workers filling up, network services going down, etc.

The platform as a service (PaaS) model offloads more responsibility onto the vendor. You shouldn’t have to worry about OS issues (though you’ll have to update your version of the vendor’s product now and then). Instead, you should be able to just worry about your application code and your data.

This means that keeping your MTTR under control is almost exclusively worrying about bugs in your own application. However, this could still mean an app logic bug causing errors or poor performance, but it could also show up as slow database queries or misbehaving code filling logs or backing up queue workers.

Further service options

The next step in the cloud services “giving up control” pyramid is the software as a service (SaaS) model. However, these aren’t very interesting for our discussion, because a SaaS is just an end-user application. You’re not going to be deploying your code to it. However, your company may rely heavily on certain SaaS products (hello GitHub!), so you may very well wind up monitoring their availability. You should also have a plan in place for handling these services being down.

The advent of serverless or functions as a service (FaaS) is more relevant. This interesting new wrinkle in the world of infrastructure options throws everything out the window except your logic. You don’t need to know anything about the hardware or what OS your code runs on. You just fire off your function and get your result back when it’s done.

This obviously makes monitoring harder. You can’t monitor a server that you don’t have access to. You can, however, get information from the FaaS provider about how long the request took and the resources it used. Calculating MTTR for a serverless app is going to mean ensuring the function completed, and in an acceptable amount of time.

MTTR as a DevOps driver

Whichever sort of infrastructure your team deploys to, modern monitoring, agile software, and DevOps practices can help you deliver outstanding service.

As mentioned earlier, traditional system administration practices often drove teams to a culture of fear. Don’t wake the sleeping baby! But just as the trend toward agile processes allows for experimentation and course-correction in software development, so has DevOps given developers the tools they need to move quickly and efficiently.

An ideal software development lifecycle consists of small, well-measured goals broken into reasonable chunks. This should include absolute transparency about direction and process. Modern continuous integration and continuous delivery further guards against missteps. Monitoring and metrics should tell you that things are on the right track. DevOps should give you the tools to experiment and verify.

This includes embracing a blameless culture, consistently calculating MTTR, and incorporating that into a feedback loop that allows the delivery of new features and recovering from incidents at an extremely high rate. The DORA report I mentioned earlier found that “elite” performers recovered from an incident 2,604 times faster than “low” performers. Wow!

SDLC heaven

The software industry has come a long way in the last half a century. But the improvements have really taken off in the last decade or so.

The agile movement brought stability to our process, with less waste and hair-pulling. Cloud computing allowed us to stop worrying about buying and upgrading servers and instead to focus on running our business. DevOps brought the benefits of version control, code reviews, and repeatability to provisioning and maintaining infrastructure.

And application performance monitoring has given us essential tools for making sure our applications are working correctly and making our customers happy. Calculating MTTR and keeping an eye on it is one of the essential drivers to keeping this humming along.

All these tools should empower the modern development team to take ownership of their product lifecycle and rapidly iterate on delivering value. Other than that, everyone and everything should stay out of their way!

Stackify’s Retrace is one excellent option for application performance monitoring. With it, you can measure your MTTR, keep an eye on errors, look for performance issues, and a lot more.

Schedule A Demo

About Ben Munat

This post was written by Ben Munat. Ben started with computers in the eighties but took a long detour through the nineties indie-rock scene. He started programming professionally in 2004, working as a consultant and for startups. Over the years, he’s used Ruby, Elixir, Java, and JavaScript. He’s worked with approximately a zillion libraries, frameworks, and APIs; solved hard problems; kept high-traffic sites running; and stomped countless bugs. He is a big fan of TDD, agile, code reviews, shared ownership, work/life balance, and having fun.