When it comes to data, let’s start with the obvious. Averages suck. As developers, we all know that percentiles are much more useful. Metrics like P90, P95, P99 give us a much better indication of how our software is performing. The challenge, historically, is how to track the underlying data and calculate the percentiles.
Today I will show you how amazingly easy it is to aggregate and create SQL based percentile rollups with Postgresql and t-digest histograms!
At Stackify, we deal with big data. We help developers track the performance of thousands of applications and billions of data points daily with Retrace. We track how long every transaction in your application takes to load. Plus, every SQL query, web service call, browser load timing, etc, etc.
The problem is, to properly calculate percentiles, you need all the raw data. So if your app runs on 100 servers and has 100 transactions a second, over the course of the month you need roughly 30,000,000,000 data points to then calculate a monthly percentile for a single performance metric.
What you can’t do is take percentiles every minute or hour from individual servers and then average them together. Averages of percentiles is a terrible idea and don’t work.
Let’s start with the basics. How do we track 30,000,000,000 data points just to calculate one simple P99 number for your custom dashboard?
The way to do this is with a data science technique called by many names and strategies. But most usually described as histograms, data sketches, buckets, or binning. People also use other types of algorithms. At the end of the day, they are all based on sampling and approximating your data.
A histogram is the most common term. Here is how wikipedia describes it:
“To construct a histogram, the first step is to “bin” (or “bucket”) the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval.”
So basically you can create 100 “buckets” and sort those 30 billion data points into those buckets. Then you can track how many are in each bucket and the sum of the values in each bucket. Based on that information, you can approximate things like a median, P90, P95 with a pretty high degree of accuracy.
At Stackify, we are always looking at different ways to ingest and calculate performance data. We recently ran across a very cool Postgresql extension called tdigest that implements t-digest histograms. Ted Dunning originally wrote a white paper on T-Digest back in 2013. It has slowly grown in popularity since then.
T-Digest is a high-performance algorithm for calculating percentiles. There are implementations of it in many languages, including node.js, python, java, and, most importantly, Postgresql.
I will walk you through some high-level basics of how it works to give you some basic understanding.
T-Digest works by dynamically calculating “centroids.” Think of these like buckets, but they are basically key data points spread across your data. As you add the first data points, it dynamically evaluates what the centroids should be and adapts as you continue to add more data. It’s a little magical.
Here is an example of what a t-digest looks like:
flags 0 count 37362 compression 100 centroids 51 (0.164000, 1) (0.165000, 1) (0.166000, 1) (0.166000, 1) (0.167000, 1) (0.504000, 3) (0.843000, 5) (1.185000, 7) (2.061000, 12) (1.915000, 11) (3.437000, 19) (7.813000, 40) (11.765000, 57) (15.448000, 72) (24.421000, 109) (49.816000, 211) (88.728000, 346) (147.814000, 538) (260.275000, 907) (420.212000, 1394) (679.826000, 2153) (854.042000, 2577) (1495.861000, 3815) (3435.648000, 5290) (3555.114000, 4491) (3366.077000, 4198) (3474.402000, 3748) (2631.066000, 2593) (1809.314000, 1773) (980.488000, 956) (1692.846000, 781) (106168.275000, 473) (166453.499000, 233) (168294.000000, 211) (87554.000000, 109) (59128.000000, 73) (42188.000000, 49) (28435.000000, 29) (20688.000000, 21) (14902.000000, 15) (11462.000000, 11) (9249.000000, 8) (5832.000000, 5) (4673.000000, 4) (3511.000000, 3) (2345.000000, 2) (1174.000000, 1) (1174.000000, 1) (1174.000000, 1) (1174.000000, 1) (1176.000000, 1)
In this example, I have 37,362 data points spread across 51 centroids, with a max of 100 centroids. Each of the data points is the sum of the values in the bucket and how many items are in the bucket. So something like (3435.648000, 5290) means there are 5290 data points, and they add up to 3435 and would be 0.649 on average.
Based on these buckets, the t-digest library can quickly calculate any percentiles across these 37,362 data points in a few nanoseconds.
Alright, let’s get into some SQL code! Most SQL databases have percentile functions built into them, like PERCENTILE_DISC and PERCENTILE_CONT. You can also use tdigest to do these types of functions. It can take basic doubles to do this, or can you query across multiple tidgest histograms. The t-digest extension allows you to use tdigest as a special column data type.
You would first need to install the tdigest extension on your Postgresql server.
Here is a simple example of using the tdigest extension to get the 99th percentile from your raw data. 100 represents the maximum number of buckets to use and therefore sets a high accuracy. This is essentially similar to using PERCENTILE_DISC.
SELECT tdigest_percentile(your_column, 100, 0.99) from rumapmevent Output: 799.5183845699754
So, that is a simple example to get us started, but we need to focus on how to digest 30 billion metrics a month aggregated and sliced in a bunch of different ways.
If you are creating a true analytics database as Stackify has for Retrace, you need to aggregate your data together using different types of metrics and different timeframes.
Here is an example of calculating percentiles by tenant_id and by day.
SELECT tenant_id, date_trunc('day', trans_utc), tdigest_percentile(total, 100, 0.99) from rumapmevent group by tenant_id, date_trunc('day', trans_utc)
But to really digest billions of metrics a month and provide amazing dashboards, we don’t really want P99 percentiles, because we can’t turn daily percentiles into a monthly percentile. We need the actual histograms to do that, and ultimately we want to save them off to rollup tables.
We need to create and store raw histograms!
If we take the last query and tell it to create a tdigest data type instead, we can now get real histograms.
SELECT tenant_id, date_trunc('day', trans_utc), tdigest(total, 100) from rumapmevent group by tenant_id, date_trunc('day', trans_utc)
OK, so now we have histograms per day. The key is being able to add these tdigest histograms together to create one over a longer period of time. In a real-world system you would create histograms by the minute, by device, or a wide array of different ways. You then have to add those up to get the final reporting. Luckily, that is no problem with tdigest.
Here is an example of how to basically make a new tdigest that aggregates the other tdigest records. I will also show you how you can calculate a percentile from the rolled-up data.
SELECT tenant_id, tdigest_percentile(td, 0.99), tdigest(td) FROM ( SELECT tenant_id, date_trunc('day', trans_utc) as timeperiod, tdigest(total, 100) as td from rumapmevent group by tenant_id, date_trunc('day', trans_utc) ) as inner_query group by tenant_id
You can now see how tdigest can either calculate percentiles or create roll-ups to aggregate the histograms together.
The above examples have been pretty simple. My goal was to give you some basic ideas of how t-digest works and show off some of its capabilities. If you are really trying to ingest lots of data and create percentiles, it is time to take it to the next level!
At Stackify, we ingest billions of data points a day to power Retrace, our APM solution for software developers. If you are wanting to constantly ingest this data and create real-time dashboards as we do, you have to create an entire ingestion pipeline and data aggregation strategy.
We recommend using Kafka, Azure Events Hubs, or similar technology to queue up and ingest your data. We recommend storing all of this data in Postgresql and using the Citus extension to allow scaling Postgresql over multiple servers.
Citus allows you to create a nearly infinitely scalable SQL backed solution for data analytics. You can also use Postgresql Citus on Azure as a PaaS solution, which is the easiest and best way to do it. Citus was acquired by Microsoft and has their backing now.
To make all this really work, all you need to do is import all your raw data into Postgresql and then figure out how you want to aggregate or rollup the data. You then need to store those rollups in new Postgresql tables. You can use pg_partman to partition the tables by time periods and easily keep the raw data for only a few days but then retain your aggregated data for a longer period of time.
Most people do the rollups by running a SQL job every minute or so that incrementally processes the new data that just came in. Basically, you track the position of the data you last processed by date and time or the primary key and then scan for newer data.
Postgresql has a really nice function for doing incremental SQL rollups by using INSERT INTO, ON CONFLICT, DO UPDATE. This makes it really powerful to continually do the updates.
An example of that would look something like this below. You will want to create a table that uses the tdigest data column type. You can then aggregate and rollup tdigest histograms however you like.
INSERT INTO rollup_table (metric_id, timeperiod, total_count, my_tdigest) SELECT metric_id, date_trunc('minute', trans_utc), count(*), tdigest(raw_column, 100) FROM yourtable WHERE tenant_id = tenant_identifier AND id >= start_processing_range AND id <= end_processing_range GROUP BY metric_id, 2 ON CONFLICT (metric_id, timeperiod) DO UPDATE SET total_count = yourtable.total_count + EXCLUDED.total_count, my_tdigest = (select tdigest(td) from (select yourtable.my_tdigest as td union all select EXCLUDED.my_tdigest as td) as x), ;
BTW, tdigest_stackify_combine is a function I made to add two tdigest records together.
Once you are creating these histogram rollups, you can easily query your data; however, you would like and calculate percentiles using tdigests’s built in functions as shown from the basic examples. Creating P99 across 30 billion records is now no big deal thanks to Citus and t-digest.