Even the smallest exception can cause material damage to your company’s operations, which, in turn can have repercussions on revenue. So it’s obviously important to have a grasp on your plan to track and troubleshoot exceptions. It seems like a simple enough process – the exception is raised, you look at it, and then make changes to your app so it never happens again. But we both know it’s never that easy.
An exception often only tells you that something went wrong. It won’t tell you why something went wrong, or how the issue occurred; only that it did. And the challenge is, you really need the why and the how to fix the problem.
You can’t simply rely on a single exception because it lacks context. In the case of troubleshooting an exception, context is the clarity required to fully assess and understand the cause of an exception. And you’re rarely going to get it with just the exception alone.
In this second article in a three-part series on simplifying and evolving your application troubleshooting, let’s look at three basic data sets (and there are more, but we’ll start with just three) that can help provide context:
- Other Exceptions – While you already know what exception started this troubleshooting process, you gain an understanding of the severity and intensity of an exception by determining how frequently the exception occurs, when specifically it occurs, as well as by looking at other exceptions that occur around the same time. Seeing exceptions across application boundaries can be especially powerful in modern distributed systems. If your web app throws an exception that’s triggered by a web service it’s consuming, you need to follow the trail to the web service and correlate the exceptions originating from the web service. In some cases, you may be correlating exceptions across multiple apps, servers, data centers, and even across different languages. It’s not always going to be easy to obtain exception data , so make sure you are working closely with IT setting up the processes and tools to allow you to see this info.
- Log Data – If you log extensively, your application logs have the potential to be the exposition of every method, process and task taken by your application leading up to the exception. They paint the step-by-step picture of both the how and why. Getting log data can be equally a challenge, whether because IT holds the keys to the servers where your logs live, or by virtue of running your apps in a dynamic cloud environment where servers come and go more readily, making it harder to get the logs even if you solve the IT access issue. When it comes to multiple servers running many applications, the amount of log data to make sense of increases exponentially – finding the relevant pieces quickly and connecting the dots between exceptions and the relevant log statements is key.
- Application and Environment Health Metrics – All too often, an exception occurs as a result of something infrastructure-related – database timeouts, full disk, out of memory, network issues, and more can all conspire to trip up your application. Sometimes the root cause is obvious from the exception, but other times you need to correlate the error with the application and environment state to see where the aberrant behavior was at that exact moment. Furthermore, if you’re experiencing sporadic problems with your environment or a third-party resource, having error telemetry and application environment health metrics together can point out a trend of problems and their impact – extremely useful when the burden of proof is on you to get your cloud partner to fix an issue on their end that you have no control over.
So, as you can see, while some exceptions are easy enough to address with little or no context, it’s far more likely you’re going to need additional data to go from just having information around what happened, to truly having some intelligence around why it happened.
In the final article of this 3-part series, I’ll next discuss how to go from gathering intelligence and context to truly having insight into your errors when troubleshooting.