Carl and Richard of DotNetRocks talk to Jason Taylor of Stackify and Michael Paterson of Carbonite about the power of great code instrumentation to make awesome software. Carbonite lives on Azure and uses Stackify for instrumentation. Michael talks about how Stackify helped Carbonite understand performance problems and errors that were occurring in production that were very hard to see from logs or customer service requests – deeply instrumenting their application made all the difference!
Listen to the podcast
Or read the transcription:
Dotnetrocks episode 1153 with guests Jason Taylor and Michael Patterson, recorded Friday May 29th 2015.
Welcome back to Dotnetrocks this is Carl Franklyn and this is Richard Campbell.
Carl: Our guests today the first of all is Jason Taylor we heard there a minute ago has worked in a number of high growth business units centered around delivering software as a service and is passionate about building scalable cloud platforms. The experience gained in those shops directly let him to Stackify and those experiences helped shaped the product. Jason has let many development teams through his career and intently focused on delivering a great product while helping developers grow, learn and realize their full potential. In his free time Jason is a lover of motorcycles, cars great beer and his family. Order counts right?
Jason: I’m lucky that my family won’t be listening to this so it’s a get out of jail free.
Carl: Awesome, Michael Patterson in principal software engineer for Carbonite who spent the last year implementing their Adjure solution; consisting of websites cloud services, table and blob storage, Reds and Stackify. How you doing guys?
Both: we’re doing great, it’s wonderful
Carl: so, first of all you can’t help but have heard of Carbonite because you guys advertise on just about every and PR show that I listen to. I have heard about Carbonite for years and years and years and it’s a pretty good product from what I understand I haven’t been the bullet yet because I’m pretty good about backups and technically savvy and all that kind of stuff but that’s the whole idea of Carbonite is that you just plug it in and magically in the background all your files get backed up in the cloud where you can download them if you need to right?
Jason: yeah so we definitely are everywhere these days and or additional initial market really was aimed that non-technical people who didn’t really understand how to do any backups and we’ve really made a really astonishing transition over into small to medium businesses so that they can help all of their employees that don’t really know anything about backup feel a little bit safer you know if somebody takes a vacation or business trip you know you will arrive on your laptop doesn’t. We kind of helped to fuel business continuity and those types of things so it’s been a lot of fun engineering a lot of processes that enable that.
Carl: And I can imagine that you guys deal with a lot of that data. I don’t know anybody who doesn’t have than a terabyte of hard drive these days
Jason: yeah I’m pretty sure where in the hundreds of Petabytes of data backed up no yeah so that’s quite a problem to put up there so Jason first of all tell us about stocking by and then we’ll see how these 2 things came together.
Jason: Okay, so at Stackify, we’re a fairly young company, we’re about 3 years old and kind of our for founding fathers here. We all came out of businesses where we were just fighting fires all the time and we were using an assortment of tools to determine where we have performance bottlenecks and to monitor our applications and infrastructure and we kind of realized that there’s a problem. There’s like too many tool that we have to buy here and they are too expensive and then we have to glue them all together and make them work. So that kind of led to the birth Stackify and we aim to really help Developers answer three questions about their applications. ‘Is there application down? Is it slow? Or is there some sort of weird behavior”. So we brought together a combination of server monitoring, application monitoring and code profiling and error in log management all into one platform. it’s easy to get started with so you can see all these different pieces that are usually apart of troubleshooting in our proactive monitoring and really just kind simplify the whole thing so in theory, you know similar to what Carbonite’s done with backup you know, they’ve taken a lot of the complicated technical parts out of it and had a can you call all the weight and just had a simple, easy to get started, easy to use solution that’s all done in a task model.
Carl: And so tell us how these things came together?
Jason: last year at Stackify we started really hitting the conference circuit, getting some floor space and getting developers more aware of Stackify and last year DevIntersection in Las Vegas, I had a group of developers from Carbonite come up and they looked at what we were doing and they kind of had this halleluiah moment they said “We’ve been looking for something exactly like this” and within an hour we had them set up and running and they had pretty much set up and fallen in love with us I can’t really blame them, we’re easy to fall in love with. I think what Mike can probably pick up the thread on the other side of that of what they were looking for in Stackify and why we were a good fit for them.
Carl: cause you to a suite of tools right? Not just one or 2 things you have a whole suite and I guess Michael you were using individual tools from individual companies that did pieces and parts well tell us about what your challenges were.
Michael: yes so it’s actually really interesting because what led to Stackify is ultimately what led to us finding Stackify and that we were using a bunch of different blog solutions trying to find a centralized location for logging and exception management and you know server monitoring weather in the data center or in the cloud and shout out to Michelle Leroux Bustamante, Jason she’s actually the one who told us to come talk to you and so is this I wasn’t actually able to attend that the vendor sections but when the rest of my team got back they were just kind of raving about Stackify and I was kind of leading the charge from the implementation standpoint on my team so I got a hold of it and I didn’t really do anything other than install the agent the log4net appender and it just blew my mind within like 5 minutes all of the metrics that I was seeing, the streaming logs, the ability to see in near real-time what’s happening inside of my applications. It just like floored me, I could not believe what I was seeing and how much better it was then, and you know all of the poking around with all the tooling that we were trying to use before so it was just awesome from the very get goes.
Carl: okay so let’s talk about some of the challenges that you guys have in particular, I mean good to willing is good to willing bought your challenge is I mean you have a high-volume, a lot of data so a lot of data coming in and a lot of data coming out probably more a more going in. What does that look like and what’s different about you guys than most companies you think?
Michael: First off my team were kind of in charge of the cash register so if our systems are not working then Carbonite is not really making any money so it’s really important that at any given moment we understand the health of our application stack, you know from the browser all the way down to you know a number of different databases and everything in between so you know our real challenge is or was at the time that for trying to expand to different regions throughout the globe 1st application and azure and really trying to understand some of the differences and not only how to identify easiest but how to fix them, how to write for the cold for the cloud versus within the data center and the real challenge once we implement implemented doc if I was actually digging through the enormous amount of data that we had and trying to figure out what’s good, what’s bad information, where the core of what’s actually important but he actually tries or upgrade or whatever that it’s going to work flawlessly.
Carl: alright so let’s go back to Stackify for a minute what are the suite of tools that this without making it sound like an infomercial there is a free tool if you use it in Azure, right?
Jason: yeah we’ve got an offering in the Azure marketplace that is a completely free version there’s a couple limits around you know how many different devices you can install and that sort of thing but it’s free to perpetually use. Absolutely, because what we want to do at the end of the day is this is starting to become a prevalent space of doing application performance monitoring and there are a few different options out there.
We think that it should be application performance monitoring should kind of be a first-class citizen for any development team; these are tools people need especially as developers, take on more and more of the operational support of their applications, they’re no longer coding and then throwing it over the wall to an operations team and saying hey it’s your baby now when there’s problems in production developers are responsible for that they need all this data to really know how their applications are performing why they are not performing well you more and at stackify we think that it should be accessible to everybody. It shouldn’t cost you more to monitor you’re applications than it does to run them.
Carl: So let’s talk about the different pieces of Stackify
Jason: so we’ve got 3 main components right so server monitoring, so you know it’s your basic server monitoring that you’re used to, you know knowing that the server is on CPU, running processes and we actually do some proxies access as well, you can get to some local file it on to all your servers we’ve got your application so that you don’t have to give you’re development team full consul access. You can get to some local file system and that that sort of thing without having to get on to all your servers after that we’ve got or application performance monitoring. So we do a couple of things there. There is a lot of dot net developers don’t necessarily know what they should be instrumenting and watching in the performance of their dotnet apps, especially at .net right? They come to us and say okay well I’ve got all these performance counters and all this WMI available, what should I be looking at, what’s important to my throughput and latency? So out of the box when our agents installed we do a lot of automatic discovery you don’t have to do a lot of 18:14 and configuration. We query IAS, we find your apps, we catalog them, we add in all the appropriate performance counters. So you can look at how much CPU and memory that particular application pool is taking compared to overall service CPU and memory.
We add in all the important performance counters for you know, request for 2nd and the cues`. week requests and all of that good stuff and the other thing that we just launched, it’s in public beta today, it’s going general availability in about a week is cold profiling so method level cold profiling you know what we look for there is basic throughput in your IIS pipeline. So how long is it taking to complete the request and then what are some of the major things that are happening in there? So from if you’re talking about an NBC application how long it’s taken to execute the controller or a compiled review and then we look for anywhere that you are up is crossing a service boundary. So you’re making an HP web request to another URL or you are connecting to a database, performing database queries, talking to cache, anything that can cause a bottleneck in your application.
Carl: and speaking of bottlenecks how do you prevent stock if I am causing bottlenecks in your application?
Jason: that’s a good question. So the way that we implement that, similar to some of the other vendors in the space is that was some COR profiling and so the.net CLR has a profiling API that’s available and if you write to that you can basically at run time inspect code that’s running and build a shadow stack and inspect object values; that’s how we see what sequel is being executed for example. So a lot of developers have at one point in their career, they’ve used a tool like ants to find a memory leak or CPU up and you get this tremendous dump of your application at runtime and it’s all also makes your application run in the snow and that’s really fair profiling everything every single method call in the framework. What we have done is just a fraction of a fraction of a fraction of everything that we could profile to make sure that we capture just enough information to get a clear picture of what’s happening but without having a performance impact and we put that through tons and tons of new testing lots of different scenarios, it adds some overhead but it’s pretty minimal, it’s a couple milliseconds and the Carbonite they have actually been Michael can probably talk to bit of performing if they’ve even noticed
Carl: Question anyway. Do you guys use background threading or background any synchronous processed that a low priority kind of thing to get out of the way?
Jason: Yeah absolutely, when you’re absolutely profiling the dotnet code, I mean that’s happening in process you capture those that enter and method enter a as they are happening leaves as they are happening but what we do is our theory is to get in and out as fast as possible. We dump it to a log file and then or stock if I agent is actually running in a different process, different and it’s picking up that all of and queuing it up to us.
Carl: So you also have sort of global exception aggregation, how does that work and how painful is that to set up?
Jason: Exception and log aggregation. So same thing there. It’s pretty low performance overhead. So we built our library and rest service to handle all of that data coming in and then you can write to our API directly or we realize a lot of developers are using log management tools for exception handling in some way so we built appenders for all the major frameworks, log for net, Noma, in log. We have full support for JAVA, PHP, ruby and we’ve used major frameworks there as well.
Carl: And Michael that was what you’re using, Log4net right?
Michael: yep we were using Log4Net and I got aside between a bunch of our different applications. It was filling up our inboxes or taking up massive mounts of disc space and we just couldn’t really get any real value out of it other than seeing that we have a bunch of log files or emails somewhere. So having the aggregation really made it easy to kind of prioritize what was most important, what we needed to pick off first and that combined with the streaming logs, every time we do an appointment now you know all kind of setback and watch the application see how it’s behaving for 30 or 45 minutes after deployment and it just made me feel so much more comfortable when we make a production change.
Carl: I got it, so the error aggregation actually gives you some real insight by thinning out all that data or is that what you’re saying?
Michael: yea definitely so for an example, it will tell me that over the last 4 hours I’ve seen this exception and say 37 times and you can see what log statements happen before or after that and depending on what additional data you pass in you can really get very, very granular into what was happening for a particular person or a particular request or child request that happen after that and it just had made trouble shooting much easier without the hassle of having to parse through log files or try to count the number of error email in my in box.
Richard: Michael when you first turn that thing on did you get a bunch of errors you just didn’t know are happening in production?
Michael: so we actually saw a couple that we didn’t know but it really surfaced the sheer numbers; the number of times specific exceptions were occurring. That was really one of the big piece that we really couldn’t see without playing around with a bunch of different tools around the logging. We just didn’t really know if this guy was always getting this exception versus this guy who wasn’t. It made it much easier to figure out that this customer, he’s got some data that’s in a bad state and that’s 45 percent of the error right there and then whatever it is.
Carl: And you could specifically go into the logs and see what the errors around those exceptions are, I mean what you were doing at that time. So I mean that’s probably the most important thing and tool like this is one thing to gather data, it’s another not to actually plod through an bunch of it to actually find answers.
Jason: yea it turns the data into information.
Carl: on the performance side you do things like that as well, so like Jason if there is a slowdown in performance I can tell pretty easily without a whole lot of fanfare, or was causing it?
Jason: I mean there is a lot of capability to do correlation air and that’s something that was always looking for easier and faster ways like a bubble up. In the event that happened and that all of the relevant contextual data that you need for that and absolutely by having your performance monitoring and all of your login exception stuff in one place it’s really easy to zoom in to a time range to see the performance issues you are having and see the exceptions that were happening and then with login being able to see who is impacted by that or who was the cause of it that is really powerful.
Richard: all right I want to dive into this and I want to come in and it’s from an angle with an Azure website up and running today and without instrumentation because I have a bunch of choices right? I mean the as your product has a bunch of insights stuff I want the free product what does that do?
Jason: that’s a good question you know we feel that we actually do some things nicer than say like application insights. App insights work really well if you’re kind of using this trifecta you’re using as if we’re talking about Azure websites, obviously you are but also using TFS online which not everybody’s going to use then app insights. We think we present the data that we find in a little bit better way we also have I believe easier and faster configuration in setup because we want to work across your entire enterprise. We talked to so many people today who are running a hybrid of.net and some Java stuff and no JS and they have a legacy PHP application and then.net developers even are picking up things like elastic search we actually use a lot of elastic search ourselves when we were at Devon a section last week somebody came up and said “do you guys know anything about monitoring elastic search?” And I said yeah well absolutely we do because we have full support for in genetics so if you have in genetics standing in front of elastic search we can monitor the heck out of that and for those sort of customers kind of have a mixed bag of technology we work really well so one common toolset beautifully, I hate to sound like I’m plugging but it’s the model is really simple and easy to understand you pay this year based on number of millions and millions of metrics does extend? Do you know?
Richard: and how would you know? In theory the way I set up app insights is that I set it in free mode which gives me 5,000,000 that data points in a month and within that month I’m going to find out pretty quickly whether I fitting that threshold or not then I’ll upgrade to the standard issue for $20 a month and that takes me to 15,000,000 data testing points and then will find out if I can fit into that and it’s about a buck 85 per million beyond the 15.
Jason: right, but you had to do a lot of work to figure that out.
Richard: well the bigger thing for me is configuration in general like it turns out instrumenting my app actually job I want to spend time on I’m trying to figure out why my app isn’t performing as well as I thought you know those kinds of things so I generally don’t turn on instrumentation until I’m already in trouble so anybody was going to come at me and say here I’m going to discover all the bits that you’re using for you and the instrument with the right set of metrics. Like I’m a performance student guide you listed off the right perfect ball numbers for me like what do I watch when I want to figure out different websites healthy DPU amount of to the.net stack so the 4 magic numbers will let me know as of is it in pain know it doesn’t actually tell you what the pain is at all but if you could give me a tool where I’m going to drive those things and then that’s the biggest problem in been a just a Web server, how the network configuration is behaving what the load bound often we think stuffs and we look for this sort of things that you know you as a developer are going to look for you know and really good the lose load but server by server is my problem and really help isolate that down. When I’m dealing with low.
In a measure of cloud services and we stuck if I and being able to do is click on that tab and seeing the breakdown of an we’ve been able to get a lot of insights as a that so it’s good to be able to have 2 still be able to spot check and consumer bus in the SLA that were supposed to get and by have tools on our end that are doing this instrumentation and were able to hold or provide his feet to the fire a little bit as well
Richard: my IT hat firmly or it will bite on where. This is the biggest challenge they are public load or not it’s one thing to have a contract that says is what you’re going to get old you translate that into numbers that you’re actually getting?
Jason: I have dealt with a lot of providers over the years whether it’s our own Colocation or someplace like Azure or AWS and exactly what use and exactly what you said they advertise and SLA but it still kind of up to you to make sure that there meeting that is so a site SLA if they aren’t they don’t usually make you aware of that they don’t pick up the phone and call you and say by the way we’ve missed your SLA for the last month here is your bucket of money back that’s just not going to happen. You’ve got a bucket of money?
Carl: Wait a minute that was from the Canadian government to pay my AT&T bill actually.
Richard: so what if I want this all on Prim Jason?
Jason: oh absolutely so you can instrument on Prim, Azure cloud services, AW S anywhere you can install our agent.
Carl: client side?
Jason: so you’re talking like client-side java script code?
Jason: in theory actually you can we haven’t really have anybody doing a lot of that yet but that’s not a big focus for us there are some differences on how we would need to present that data in aggregate that all on or side that we just have to put a big focus on it right now a it’s mainly about server sites.
Richard: it’s a different kind of air collection useful but different.
Jason: now error collection and log aggregation you absolutely could do that
Richard: it’s very useful to know what errors are occurring on client machines before complaints
Jason: right absolutely because I should point out along with that question is you don’t have to install or agent to do error and log aggregation that’s merely incorporates the library into your code and it’s going to get to us.
Richard: right you guys got to take care of the rest, priced accordingly.
Richard: This is not a part of the free product to either
Jason: what error and log aggregation? Actually our Azure 3 product does have a certain amount of error and log data that you can send I think it its like 50 GB a log data that you can send us but that’s all over a 30 day rolling window
Richard: nice and immediately your prices are terribly expensive 20 bucks a month for collecting all the logs and everything.
Carl: per server, it’s a pretty good deal.
Michael: it’s been invaluable for Carbonite.
Richard: it’s a bit expensive that’s the real question Michael.
Michael: no, honestly it really hasn’t so we turned off the server monitoring and we continue to use the logging and arrow management in those environments and then for staging and production I mean for the value that we get out of it I want to say where paying maybe a couple hundred bucks a month for his esteem as large as ours it’s really just you know pennies.
Richard: you shave a few hours of the debug, you pay for the whole month, and it’s ridiculous
Carl: yeah that’s crazy and you just mention figure know that the problem exists before the ones user does I mean we do that more or less routinely know the number of calls to our customer service is dropped you know just overall the health of for system has been better because of the way that stackify services this information for us.
Richard: I’ve been working with an organization whose goal for the tech support was to make more calls to customers than they received, so that when the problem occurred they could call the customer before the customer called them to say hey we had a problem. Hey we noticed you had a problem and we’re working on it.
Karl: keyword being had a problem
Richard: it’s like you’ve had a problem you’ve detected it, you’ve able to identify which is, you’re able to get an action together and then contact the customer and say severe “were making your life better we know and were working on it.”
Carl: I called that kind of service, creepy awesome, thanks I think.
Richard: I honestly like it when my credit card company calls me and says hey are you in Chicago is where just seeing some transactions here and it’s like all you’re paying attention and I think for us as service providers to be able to say hey we notice your struggling with this piece of the app. There is another level here beyond not just generating errors but imagine being able to detect frustration that behavior inside of the app that sort of going back and forth and trying to figure out how to use something like you can see the signature of somebody struggling with the feature and to have your tech support people say hey I saw you struggling with this feature can I help you? and did they respond yeah it’s my favorite music service? Because we get that a lot.
I have a relative who still no matter how many times I tell this person know it’s not spotsify I just like can you give me some free music? I don’t work at spotify.
Carl: just for the few minutes we have left to Jason at DevIntersection we were talking about some of the cool things that Carbonite was doing just in terms of pure geek appeal like writing filesystems and doing all sorts of low level stuff. Micheal, was that you that did all that or you had to jump through a lot of hoops to get through to a system like Carbonite so fast and efficient?
Michael: no we have a server team and these guys are just absolutely brilliant and their entire goal in life is to be able to compress files is absolutely small as possible given the amount of data that we store and so they handle all that and it’s my job from the application perspective make sure customers are able to actually send their data.
Carl: pretty cool you guys are pretty smart you must be able to be able to handle all that data. Are there any other stories you guys have from your collaboration that your listeners might be interested in? I think our most recent adventure together was we had a performance the seal in production and only in production as the case however seems to be and Jason let us know about the APM plus product the one that’s in beta and this was correct me if I’m wrong Jason but I think this was even before it was in a public beta and he shift us over the bids, we turned it on and within I think half an hour identify specifically what the cause was and put a hot fix on it about 2 hours later and the performance is still was an easy really causing a lot of headaches so all told from start to finish maybe like 6 hours instead of a week and a half we also had memory leak that wasn’t pre-existing after he installed kept rebooting so every time I would try to login and take a memory dump so I could figure out what’s going on 75% and have the problem fixed in about and it’s really just you know time after time after is relatively big and we’ve got a lot of hands moving kind of this high-level view system have all the way down to the nitty-gritty details I think makes everyone just breathe a lot easier because we don’t feel I the application is healthy, we actually know that it’s healthy.
Mechanism will be really useful.
Richard: in fact one of the other thing getting promoted and you are able to identify the number of requests we were able to scale up piece by piece until you know able to do that kind of job servers I can able to do it in an intelligent way that ultimately saves us a lot of money so you know what services to add more of.
Jason: we knew very specifically which services and given that the track all the requests for certain services were kind of being hammered maybe 4 months ago which I have to say is just my absolute and then sure enough you get a mega proxy load-balancing strategy is just incorrect.
Carl: all right guys has been a great show I’ve got to run. Thanks very much for talking to us it’s been great.