In this post, we will cover ways to address high cardinality metrics in time series databases such as Prometheus — one of the most popular open-source time series databases — via Recording Rules and Streaming Aggregations. We will understand the pros and cons of both approaches and their performance impacts. We will cover the following topics.
- What is High Cardinality?
- Ways to reduce Cardinality
- What are Recording Rules
- What are Streaming Aggregations
- Comparison of Recording Rules vs. Streaming Aggregations
What is High Cardinality?
High cardinality is a situation where many unique values exist in a metric data set. In the context of monitoring and observability tools, high cardinality metrics can cause performance and scalability issues.
Consider a metric tracking API requests such as http_requests
.
# HELP http_request_duration_seconds Duration of all HTTP requests
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{code="200",handler="login",method="get",le="+Inf"} 5
http_request_duration_seconds_count{code="200",handler="login",method="get"} 5
# HELP http_requests_total Count of all HTTP requests
# TYPE http_requests_total counter
http_requests_total{code="200",method="get"} 5
The metric is available across the following labels.
┌────────────┬──────────────┐
│ Labels │ Sample │
├────────────┼──────────────┤
│ code │ 200 │
│ method │ /login │
│ handler │ POST │
│ le │ 14.5 │
└────────────┴──────────────┘
- There are ten possible status codes
- Four methods (POST, GET, DELETE, PUT), and
- 15 handlers.
- Latency is tracked in 60 buckets ranging from milliseconds to hours.
The cardinality distribution looks like this.
┌────────────┬──────────────┐
│ Labels │ Cardinality │
├────────────┼──────────────┤
│ code │ 10 │
│ method │ 4 │
│ handler │ 15 │
│ le │ 60 │
└────────────┴──────────────┘
This means there are 36,000 (10 * 4 * 15 * 60) unique combinations, also known as 36,000 TimeSeries, for the http_request_duration_seconds_count
metric.
Let’s say this API is being served from 100 hosts. So, we need to consider the hostname
label.
┌────────────┬──────────────┐
│ Labels │ Cardinality │
├────────────┼──────────────┤
│ code │ 10 │
│ method │ 4 │
│ handler │ 15 │
│ le │ 60 │
│ hostname │ 100 │
└────────────┴──────────────┘
If I add the hostname
to the labels and have 100 hosts, the number becomes 3,600,000, or 3 million. This illustrates how quickly cardinality can grow — it multiplies rapidly!
Ways to reduce Cardinality
There are different ways of reducing the cardinality. One can drop the labels causing high cardinality, such as hostname or container names, at the scraping source, so you never run into the data explosion. But that's counterintuitive to getting better answers from the time series data. The whole point of getting them scraped was to extract meaningful solutions from them.
Suppose there are repeat queries, which is primarily true with Grafana dashboards refreshed every few minutes. In that case, there is merit in pre-computing repeat queries and storing the result as time series itself.
Another advantage of this approach is the time not spent computing such queries at runtime. They can be resource and CPU-intensive, making the time series backend scan many time series, effectively slowing down the performance. We can eliminate the compute penalty by pre-computing such metrics before their actual use in Grafana dashboards or alerts. This helps optimize resources and have a high-performance, fault-tolerant time series database.
There are two broad ways of pre-computing the time series data in the aggregate form —
- Recording rules allow users to perform advanced operations on their metrics after the data is ingested using PromQL-based rules.
- Streaming Aggregation, on the other hand, allows users to define aggregation rules that run on ingested raw data in real-time, thus reducing the amount of data that needs to be stored in the time series database.
What Are Recording Rules?
Recording rules are a feature of Prometheus that allows users to define new time series based on existing ones. With recording rules, users can perform advanced operations on their metrics, such as aggregating, filtering, and transforming them into more meaningful metrics. Such aggregate metrics can then be used for querying using PromQL.
Recording rules get evaluated after data is ingested, meaning they have a performance penalty. They consume CPU and compute resources and may also introduce lag when the new time series is queried, as the recording rules may still be evaluated when such queries are run.
Example of a Recording Rule
Let's say you want to create a recording rule to aggregate the metrics from multiple targets into one metric. Specifically, you want to calculate the total request rate across all targets for a specific endpoint.
Here's how you could define the recording rule in the Prometheus config:
Define the aggregation
You want to sum the request rates across all targets for a specific endpoint. You can use the sum aggregation function to accomplish this. Here's an example expression:
http_requests_total{endpoint="/api/v1/status"} = sum(rate(http_requests_total{endpoint="/api/v1/status"}[5m])) by (job)
This expression calculates the total request rate across all targets for the /api/v1/status
endpoint, using a sliding window of 5 minutes. The by (job)
clause groups the results by the job label, identifying the target the metric came from.
Specify the recording Rule
Once you've defined the aggregation expression, you can create a recording rule to store the aggregated data in the new metric you defined. Here's an example of the YAML config:
groups: -
name: http_requests
rules: -
record: http_requests_total:sum
expr: sum(rate(http_requests_total{endpoint="/api/v1/status"}[5m])) by (job)
This recording rule creates a new group named http_requests
, which contains a single rule that uses the sum aggregation expression to calculate the total request rate for the /api/v1/status
endpoint. The result is stored in a new metric named http_requests_total:sum
. The value record
key is used for the new metric name.
With this recording rule in place, you can query the http_requests_total:sum
metric using PromQL to see the total request rate across all targets for the /api/v1/status
endpoint. This can be useful for monitoring the overall health of your system and identifying performance bottlenecks.
Why Do Recording Rules Matter?
Recording rules enable you to pre-calculate frequently used or compute-intensive computations and store their outcomes as a distinct collection of time series. By querying the pre-computed results, you can significantly speed up the process compared to executing the original computation each time it's required. This capability proves particularly advantageous for dashboards that require repetitive querying of the exact computation whenever they refresh.
What Is Streaming Aggregation?
Streaming aggregation allows users to define aggregation rules that run on ingested data in real-time. Unlike recording rules, streaming aggregation happens in real time during data ingestion.
In this diagram, you see the process of streaming aggregation. Ingested data from the input stream is processed by the aggregation rules in real-time. The output of the aggregation rules is the aggregated data, which can be used for various purposes, such as real-time monitoring and analysis.
The benefit of streaming aggregation is that it happens during data ingestion, so there is no performance penalty or chance of lag while querying the result of stream aggregation. This means that users can get real-time insights from the data without waiting for the data to be processed later. The data backend is also not overwhelmed with recording rules queries once the data is ingested, effectively aiding in the high performance of time series backends.
Example of Streaming Aggregation
The process of defining the aggregation rule remains the same using PromQL, but different time series databases may have different ways of evaluating such rules. By default, Prometheus does not support streaming aggregation out of the box.
Levitate - Last9’s time series warehouse recently launched support for stream aggregations via a YAML-based config and GitOps pipeline based on Github or Gitlab CI/CD workflow.
An example of a stream aggregation rule in Levitate looks like the following:
- promql: 'sum2 by (stack, le) (http_requests_duration_seconds_bucket{service="pushnotifs"}[1m])'
as: aggregated_http_requests_duration_seconds_bucket
Read more about Levitate’s. Streaming aggregation support here.
How Can Recording Rules and Streaming Aggregation Help You?
Recording rules and streaming aggregation can help SRE teams manage metrics more efficiently and effectively. Here are some of the ways these rules can be used to improve the performance, scalability, and availability of your monitoring system:
- Reliability: By reducing the number of metrics stored in the database, both recording rules and streaming aggregation can improve the reliability of the monitoring system.
- Scalability: By reducing the amount of data that needs to be stored in the database, recording rules and streaming aggregation can improve the scalability of the monitoring system.
- Performance: Reducing the amount of data that needs to be processed by time series storage, recording rules, and streaming aggregation can improve the monitoring system's performance.
- Time to Results Availability: By providing users with more meaningful metrics that can be easily analyzed and acted upon, recording rules and streaming aggregation can reduce the time to results availability.
- Cost Efficiency vs. Data Loss: Streaming aggregation provides a cost-efficient way to handle high-cardinality metrics without compromising data quality. This is achieved by using aggregation rules that allow users to reduce the amount of data that needs to be stored in the database. However, data loss is potentially risky when using streaming aggregation. This is because the system aggregates data in real time, meaning some data points may be missed if not captured during the aggregation process.
Performance Penalties
Recording rules are applied to the stored data to create new time series. This process happens after the input data streams are ingested, making a performance penalty possible. The performance penalty is because time series storage has to read the data from the database, apply the recording rule, and then store the new time series back into the database.
Recording Rules vs Streaming Aggregation: A Comparison
The below table is a summation of the differences between the two:
Metric | Streaming Aggregation | Recording Rules |
---|---|---|
When does it happen? | During data ingestion | After data is ingested |
Performance penalty | No | Yes |
Reliability | Reduces the amount of data stored in the database | Reduces the number of metrics stored in the database |
Scalability | Allows users to handle larger volumes of data without compromising performance | Reduces the amount of data that needs to be stored in time series storage |
Performance | Improves performance by reducing the amount of data that needs to be stored in the database | Improves performance by reducing the amount of data that needs to be processed in time series storage |
Time to results availability | Provides users with real-time insights into their metrics | Provides users with more meaningful metrics that can be easily analyzed and acted upon |
Cost efficiency vs. Data loss | Provides a cost-efficient way to handle high-cardinality metrics, but there is a potential risk of data loss | Cost can increase because of more compute resources, no possibility of data loss |