Founder & CEO / Stackify
One of my favorite sayings is “Software is like fashion, there is no final version.” This statement is very true, and is also correct when it comes to application performance. As long as you keep changing your application code and/or the usage of it increases, you will forever find new problems. Software only becomes very stable if you stop changing the code, fix all the bugs, and the usage of it does not grow much.
Problem #1: Large object heap & garbage collection
We receive a ton of metrics, profile traces, errors, and log data as part of our service. Most of which all comes in via JSON. So we are always battling with how to optimize the usage of those large strings. We try to keep them as streams as much as possible and avoid unnecessarily deserializing the JSON. But we still battle garbage collection problems.
How did we find this problem? High memory & CPU usage quickly raises an eyebrow, but really identifying the problem required watching some .NET performance counters around GC. % GC Time being one of the most important to always watch. We monitor these metrics via Stackify’s ability to monitor performance counters.
Solution #1: Optimizing string usage and large object heap settings
As data comes into our system, we queue it up. As part of that process, we compress the JSON and have to break it into pieces to optimize how we queue it in Azure table storage. When we read the data back, we found that we were decompressing and reassembling it back to a string before actually deserializing the JSON. We were able to write some tricky code to read the pieces as a stream that saved a single large string allocation every time. This helped a ton with garbage collection! Large strings are always a performance killer.
The big problem with constant usage of the large object heap is memory fragmentation. Luckily, there is a new feature in .NET 4.5.2 that optimizes and compacts the large object heap. As much as we would love to avoid it, we can’t completely. This optimization has helped our GC time a lot.
Problem #2: Too many cache calls
We use a lot of shared cache locking (via redis) within our entire system. We use it to ensure that certain code is not executed more than once or to control how often it is run. This locking is key to allowing us to do a very high level of event processing in our backend worker roles.
Our code utilizes a simple 1 second timeout in most places when connecting to redis. We also optionally built in some retry logic in case it fails. Our code that is supposed to keep retrying, kept retrying when a known failure occurred due to the cache key being locked. This caused many more cache calls to happen than was supposed to. Probably an order of magnitude more often. Whoops! 🙁
So how did we find this problem? We of course use our APM product, Retrace, to profile our own apps, but just recently built support for non web apps, which we are dogfooding now.
By the way, we will soon be releasing support to our clients so they can also track the performance of their Windows Services, Azure Worker Roles, WebJobs and more. We are also getting ready to release Prefix Premium which will also be able to profile these types of apps on your dev box. Stay tuned!
Whole lot of cache calls… and a waste of a second (screenshot from Stackify Retrace).
Check out our previous blog post about distributed method mutexing: Distributed Method Mutex – How To Prevent Duplicate Event Processing At Scale
Solution #2: Fixing the cache retry logic
Solving this problem was very simple. We just fixed the retry logic to accept that the cache was locked after the first attempt and not count that as a failed attempt.
Problem #3: Higher monitoring agent CPU usage
As part of the v3 release of our Windows agent, we noticed CPU usage was higher than normal for some apps and some clients. The vast majority of the CPU usage by our agent is from parsing the JSON output of our profiler and from parsing SQL statements. We parse the SQL statements to remove any sensitive data and normalize the query so that we can uniquely identify a query. As part of Retrace (our dev APM product, we need to be able to provide reporting for our clients so they uniquely know which queries are used and how often.
To dig into the root cause of the higher CPU usage, we used the Visual Studio profiler to reprocess profiler output from several apps. This is something we do on a regular basis always looking to lower the CPU and memory usage of our agents.
Solution #3: Optimizing SQL parsing caching
In v3 we made a change to start tracking the names of SQL parameters. We found that this was messing up our caching that occurred in our SQL parsing logic. A couple simple changes fixed that issue. We also found that we had never really optimized the caching of batch INSERT statements, which is something we actually do a lot of in some parts of our apps. We figured out how to properly cache how those are parsed as well. BTW, if you want to help us when it comes to lowering CPU usage of our agent, please use parameterized queries! 😉
The 3 problems we highlighted in this article are all related to either code changes, or our growth. Our company is rapidly growing and so is the usage of our services. We will continue to find scalability problems along the way. Our redis caching issue, for example, manifested itself as high redis CPU usage, a problem that would have only occurred due to our rapid growth.
If your app is constantly changing or usage is increasing, it is critical that you have good tools in place for monitoring and finding the root cause of performance problems.
Also check out this previous blog post: ASP.NET Performance: 9 Types of Tools You Need to Know!