What Is Site Reliability Engineering and Why You Should Embrace It

Matt Watson Developer Tips, Tricks & Resources Leave a Comment

Software developers spend a lot of time chasing bugs and putting out production fires. I’ve been a software developer for over 15 years and it has always just been part of the job. Thanks to agile development, we are constantly shipping new code. By-products of constant change are constant issues with performance, software defects, and other issues that eat up our time.

Web applications that receive even a modest amount of traffic require constant care and feeding. This includes overseeing deployments, monitoring overall performance, reviewing error logs, and troubleshooting software defects.

These tasks have traditionally been handled by a mixture of lead developers, development management, system administrators and more often than not, nobody. The problem is that these critical tasks lacked a clear owner … until now.

What is site reliability engineering?

Site reliability engineering (SRE) empowers software developers to own the ongoing daily operation of their applications in production. The goal is to bridge the gap between the development team that wants to ship things as fast as possible and the operations team that doesn’t want anything to blow up in production.

In many organizations, you could argue that site reliability engineering eliminates much of the IT operations workload related to application monitoring. It shifts the responsibility to be part of the development team itself.

“Fundamentally, it’s what happens when you ask a software engineer to design an operations function.” – Niall Murphy, Google

Site reliability engineers typically spend up to 50% of their time dealing with the daily care and feeding of software applications. They spend the rest of their time writing code like any other software developer would.

A key skill of a software reliability engineer is that they have a deep understanding of the application, the code, and how it runs, is configured, and scales. That knowledge is what makes them so valuable at also monitoring and supporting it as a site reliability engineer.

Some of the typical responsibilities of a site reliability engineer:

  • Proactively monitor and review application performance
  • Handle on-call and emergency support
  • Ensure software has good logging and diagnostics
  • Create and maintain operational runbooks
  • Help triage escalated support tickets
  • Work on feature requests, defects and other development tasks
  • Contribute to overall product roadmap

History of site reliability engineering

Google logo

The concept of site reliability engineering started in 2003 within Google. As Google continued to grow and scale to become the massive company they are today, they encountered many of their own growing pains.  Their challenge was how to support large-scale systems while also introducing new features continuously.

To accomplish the goal, they created a new role that had the dual purpose of developing new features while also ensuring that production systems ran smoothly. Site reliability engineering has grown significantly within Google and most projects have site reliability engineers as part of the team. Google now has over 1,500 site reliability engineers.

Site reliability engineering vs DevOps

So, I know what you are thinking … how does site reliability engineering compare to DevOps?

Traditionally, DevOps has been more about collaboration between developer and operations. It has also focused more on deployments. Site reliability engineering is more focused on operations and monitoring. Depending on how you define DevOps, it could be related or not.

At Stackify, we have hundreds of servers and we don’t even have an IT operations team. So when I think of DevOps, I actually think about the functions of site reliability engineering. For other companies like us who were born in the cloud and heavily use PaaS services, I believe they will also see site reliability engineering as the missing element to their development team success. We effectively operate as a NoOps team.

For larger companies or companies who don’t use the cloud, I could see them using both DevOps and site reliability engineering. DevOps practices can help ensure IT helps rack, stack, configure, and deploy the servers and applications. The site reliability engineers can then handle the daily operation of the applications. They also work as a fast feedback loop to the entire team about how the application is performing and running in production.

Site reliability engineering skills

The type of skills needed will vary wildly based on your type of application, how and where it is deployed, and how it is monitored. At Stackify, most of our applications are deployed to Azure PaaS with a little PowerShell. In-depth knowledge of Windows or Linux systems management isn’t much of a priority for us. We live in a pretty serverless world at Stackify. However, it may be really critical to your team depending on how your applications are deployed.

The other key skills for a good site reliability engineer are more focused on application monitoring and diagnostics. You want to hire people who are good problem solvers and have a knack for finding problems. Experience with application performance management tools like Retrace, New Relic, and others would be really valuable. They should be well versed at application logging best practices and exception handling.

The future of site reliability engineering

Software developers are increasingly taking a larger role in deployments, production operations, and application monitoring. The tools available today make it extremely easy to deploy our applications and monitor them. Things like PaaS and application monitoring solutions like Retrace make it easy for developers to own their projects from ideation all the way to production.

I believe that IT operations will always exist in most medium to large enterprises. But I believe their type of work will continue to change because of the cloud, PaaS, containers, and other technologies. I previously wrote about divvying up the development and operations tasks.

Summary

As a developer who has been writing code for over 15 years, I feel like I have always been a site reliability engineer, but I just didn’t have the job title. In the future, I think every team will have site reliability engineers who take ownership of production operations. Thanks to the cloud and application monitoring tools like Retrace, it has never been a better time to be a software developer.

If you want to learn more about site reliability engineering, you can check out the free online book from the Google team.

About Matt Watson

Matt is the Founder & CEO of Stackify. He has been a developer/hacker for over 15 years and loves solving hard problems with code. While working in IT management he realized how much of his time was wasted trying to put out production fires without the right tools. He founded Stackify in 2012 to create an easy to use set of tools for developers.