Your application is perfect, flawless in every way and always works… right?
It’s more likely you spend a fair portion of your time troubleshooting problematic behavior in your applications – probably more than you’d care to. And that’s not because your code is necessarily buggy; it’s because applications have become far more complex than they used to be. Your company may have moved all servers to the cloud, so knowing where your apps reside is now another degree of separation away, and the server hosting your app yesterday may be gone today. Security and privacy concerns are understandably greater today than ever before, making IT hesitate to giving anyone in dev the ability to retrieve key information like log files from the servers hosting their apps. And, to complicate things further, apps are far more distributed now than in years past, resulting in problems that can span multiple servers, multiple data centers, even multiple continents. So how can you simplify the process and make troubleshooting application issues easier? In this first of three articles on simplifying and evolving your application troubleshooting, I want to call out three simple steps you can take to improve your troubleshooting game.
- Gain access – Currently in most companies developers either don’t get any access to production servers or get too much, finding that middle road which allows them to get the logs, errors and other info they need without the risk of direct login is a key. It doesn’t relieve IT and developers of establishing clear responsibility and accountability, but rather helps them work together and not point fingers at each other.
- Proactively monitor apps and their resources 24/7 so that the historical (contextual) data you need when responding to issues will be there waiting for you. Error monitoring provides ‘after the fact’ notification, but constant logging of everything (with the proper tools) better completes the picture and helps developers find the issues faster.
- Rethink app health – understanding your application behavior is something that is dependent on a variety of things from the infrastructure, be that locally or on the cloud, or within the software – in which case it could be the DB, web pages, performance and a variety of other elements. There are a variety of things that could hurt your app health, including (but not limited to): something in the software was changed, configuration problems, database is down, 3rd party service is down, spike in # of users, bad data, or general performance problems. Are you tracking the combination of these elements and each one of them separately?
In the 2nd part of this 3-part series, I’ll discuss how augmenting exception with log data help to bring intelligence to your troubleshooting.