When Disaster Strikes: What to Do When Your App Fails

By: Mark Billen

| August 18, 2015

Imagine you moved to a house on a flood plain. Are you the type of person who would put up a flood defense system when you moved in? Or would you be out there with a armful of buckets once the rains came?

Now picture the house as an application your company is working on. In many cases, people will have insurance to protect their house from the effects of flood damage or fires, and likewise companies take the proactive approach of safeguarding their applications with troubleshooting tools.

However for some, application monitoring as well as the effective management of errors and logs still remains a buckets-out, all-hands-to-the-pump, “reactive” reflex. In the worst cases apps are already launched, out in the wild and critical to representing business services, revenue and reputation.

“Historically, the role of developers in their applications really kind of ended at deployment,” explains Craig Ferril, COO of the cloud-based application monitoring software, Stackify. “Then it goes to Ops and the Operations Team manages it, once there’s a problem they might tap developers on the shoulder and ask for some help but ultimately they kind of own the production application.”

Craig continues: “As applications have gotten significantly more complex developers became the only people who know how to troubleshoot these applications, it’s just time for developers to have better tools and to think a little bit differently about their applications when they go into production.”

Get an agent on the case

So what can be done when the fires ignite and more drastic action is called for? How do you quickly identify the source of the failure?

With no application monitoring or application performance management solution in place, you need to understand what is not working and how end users are affected. It starts with installing an application and server monitoring, and then triaging the situation. In the case of an integrated, all-in-one product such as Stackify this process offers confidence in simplicity. This SaaS-based, cross-server, cross-application suite can in fact be installed at any stage and run comfortably with minimal configuration on local, cloud or VM servers.

In little more than an hour, for example, such a comparatively small level of investment in time and cost sheds pivotal insight into the application stack. Right “out of the box” the fire-fighting reactive work is now underway. Lightweight and occupying a small footprint in terms of resources, Stackify immediately opens a window into the application health. From here developers can identify whether unwanted processes are running unexpectedly, locate performance bottlenecks, page inefficiency, critical service activities, slow processes and identify the resources being used.

From this initial overview, Stackify’s deeper application performance management of database queries, code objects, queues, custom metrics, error rates and more is all accessible from one dashboard.

“What we do is really provide developers with out of the box functionality that helps them with very little effort on their part,” says Craig Ferril. “If you think about where you troubleshoot problems – looking at the database, application, servers, network, logs, errors and such – you start to develop patterns.”

What’s causing the crisis?

At a pressurized time of app crisis, the advantage of such integrated performance monitoring also becomes more significant. These are the moments where reductions in downtime and impact are invaluable, placing greater emphasis on getting to the heart of the problem and mitigating the disaster. One of the key benefits, therefore, of Stackify is integrating error and logs which make it easier to identify the root cause of an issue.

Having this data in one place synchronized to the time of the error is very hard to achieve using separate log management tool and error tracking tools.

Trouble-shooters face legwork to orchestrate, organize and correlate when actions and responses have taken place. Stackify instead pulls all this information into a single pane of glass and synchronizes it by time, building a clear breadcrumb trail for what has happened and when. This is hugely important when establishing trends or patterns rapidly during heightened periods of performance concern.

Follow the clues

In a reactive scenario, Stackify becomes particularly valuable in that process of establishing root cause – the events that caused the problem. Much of the “panic” associated around this is then amplified by a lack of information. Essential questions become centered on what is happening and why is something happening? Indeed Stackify’s own December Application Troubleshooting Report reveals just how much faster such questions can be answered depending on your monitoring setup.

It found organizations with standalone tools cited 52% of issues taking half a day to determine root cause, as opposed to just 37% using integrated tools. What’s more, an impressive 46% of issues take just one hour to identify root cause with integrated tools, versus 32% via standalone. With finding the “root cause” so vital in targeting the heart of the fire, developers and Ops can take much confidence from such efficiency.

Best practices checklist for app developers:

To reference The Hitchhiker’s Guide to the Galaxy: “DON’T PANIC” In your app performance time of need, all is not lost. But swift action is vital.
Install a solution such as Stackify on the application server, or as many as you see fit. The speed here in “baselining” an erroneous app state with integrated, time-synchronised feedback is shown to reap dividends in identifying root cause.
Now the process of elimination can begin, with developers and Ops equally empowered to rule out and check issues off. Set up monitors on devices, third-party web services, the network, errors and logs, and specific exceptions pertaining to specific environments. If performance is being throttled you’ll need to know why, when and how error telemetry fits within such trends.

It really is about building up snapshots of where your development is to address problems now and tomorrow. Ongoing application monitoring is iterative but with a solution like Stackify deployed reactively you’ll be back to “Don’t Panic!” in no time.

Moving from reactive to proactive

From this reactive state many development setups discover the best endorsement of application monitoring practices. By troubleshooting processes and tools, you are already identifying useful trends, patterns and behaviors for subsequent builds. From here you can easily instrument your code and redeploy, directing all data to monitors immediately and detecting those errors, spikes, and “smoking guns” much earlier.

This is where things become much more reactive, reducing too heavy a dependence on emergency measures in future. “It can be a natural evolution,” explains Craig Ferril. “You don’t have to teach a development team overnight how to think and act like an operations team and monitor anything and everything. You can start out a little bit reactively.”

He continues: “Then over a series of iterations or releases you build this really comprehensive body of proactive monitors that thoroughly instrument, measure and monitor your application in all respects. You’ll then know quite quickly if anything at all is not quite right with your application and can pounce on it before it spirals out of control.”

This post is brought to you in collaboration with GetApp – GetApp is the leading premium business app discovery platform on the web. The site focuses on profiling established business apps — mostly software as a service (SaaS) — targeting an audience of small and medium-sized businesses and business buyers from enterprise departments. The article was written by Mark Billen.

Improve Your Code with Retrace APM

Stackify's APM tools are used by thousands of .NET, Java, PHP, Node.js, Python, & Ruby developers all over the world.
Explore Retrace's product features to learn more.

Learn More

Author

Mark Billen