🏏 450 million fans watched the last IPL. What is 'Cricket Scale' for SREs? Know More

May 24th, ‘23/8 min read

Streaming Aggregation vs Recording Rules

Streaming Aggregation and Recording Rules are two ways to tame High Cardinality. What are they? Why do we need them? How are they different?

Share:
Streaming Aggregation vs Recording Rules

In this post, we will cover ways to address high cardinality metrics in time series databases such as Prometheus — one of the most popular open-source time series databases — via Recording Rules and Streaming Aggregations. We will understand the pros and cons of both approaches and their performance impacts. We will cover the following topics.

  • What is High Cardinality?
  • Ways to reduce Cardinality
  • What are Recording Rules
  • What are Streaming Aggregations
  • Comparison of Recording Rules vs. Streaming Aggregations

What is High Cardinality?

High cardinality is a situation where many unique values exist in a metric data set. In the context of monitoring and observability tools, high cardinality metrics can cause performance and scalability issues.

Consider a metric tracking API requests such as http_requests.

# HELP http_request_duration_seconds Duration of all HTTP requests
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{code="200",handler="login",method="get",le="+Inf"} 5
http_request_duration_seconds_count{code="200",handler="login",method="get"} 5
# HELP http_requests_total Count of all HTTP requests
# TYPE http_requests_total counter
http_requests_total{code="200",method="get"} 5

The metric is available across the following labels.

┌────────────┬──────────────┐
│ Labels     │  Sample      │
├────────────┼──────────────┤
│ code       │          200 │
│ method     │       /login │
│ handler    │         POST │
│ le         │         14.5 │
└────────────┴──────────────┘
  • There are ten possible status codes
  • Four methods (POST, GET, DELETE, PUT), and
  • 15 handlers.
  • Latency is tracked in 60 buckets ranging from milliseconds to hours.

The cardinality distribution looks like this.

┌────────────┬──────────────┐
│ Labels     │  Cardinality │
├────────────┼──────────────┤
│ code       │           10 │
│ method     │            4 │
│ handler    │           15 │
│ le         │           60 │
└────────────┴──────────────┘

This means there are 36,000 (10 * 4 * 15 * 60) unique combinations, also known as 36,000 TimeSeries, for the http_request_duration_seconds_count metric.

Let’s say this API is being served from 100 hosts. So, we need to consider the hostname label.

┌────────────┬──────────────┐
│ Labels     │  Cardinality │
├────────────┼──────────────┤
│ code       │           10 │
│ method     │            4 │
│ handler    │           15 │
│ le         │           60 │
│ hostname   │          100 │
└────────────┴──────────────┘

If I add the hostname to the labels and have 100 hosts, the number becomes 3,600,000, or 3 million. This illustrates how quickly cardinality can grow — it multiplies rapidly!

💡
In cloud-native and Kubernetes environments, dealing with high cardinality has become an undeniable reality. However, there are better solutions than simply reacting to the exponential growth of metrics at their point of origin. We require practical tools capable of ingesting high cardinality metrics and offering control mechanisms to effectively manage this cardinality.

Ways to reduce Cardinality

There are different ways of reducing the cardinality. One can drop the labels causing high cardinality, such as hostname or container names, at the scraping source, so you never run into the data explosion. But that's counterintuitive to getting better answers from the time series data. The whole point of getting them scraped was to extract meaningful solutions from them.
Suppose there are repeat queries, which is primarily true with Grafana dashboards refreshed every few minutes. In that case, there is merit in pre-computing repeat queries and storing the result as time series itself.
Another advantage of this approach is the time not spent computing such queries at runtime. They can be resource and CPU-intensive, making the time series backend scan many time series, effectively slowing down the performance. We can eliminate the compute penalty by pre-computing such metrics before their actual use in Grafana dashboards or alerts. This helps optimize resources and have a high-performance, fault-tolerant time series database.

There are two broad ways of pre-computing the time series data in the aggregate form —

  1. Recording rules allow users to perform advanced operations on their metrics after the data is ingested using PromQL-based rules.
  2. Streaming Aggregation, on the other hand, allows users to define aggregation rules that run on ingested raw data in real-time, thus reducing the amount of data that needs to be stored in the time series database.

What Are Recording Rules?

Recording rules are a feature of Prometheus that allows users to define new time series based on existing ones. With recording rules, users can perform advanced operations on their metrics, such as aggregating, filtering, and transforming them into more meaningful metrics. Such aggregate metrics can then be used for querying using PromQL.
Recording rules get evaluated after data is ingested, meaning they have a performance penalty. They consume CPU and compute resources and may also introduce lag when the new time series is queried, as the recording rules may still be evaluated when such queries are run.

Evaluation workflow for recording rules
Evaluation workflow for recording rules
💡
Recording rules are extremely common in the Prometheus world, and many third-party tools, such as Sloth - Prometheus SLO generator, use recording rules to generate metrics required for SLOs. If you are using such a third-party library, do an audit of the recording rules added by such libraries, as otherwise, it may impact the performance of your time series backend.

Example of a Recording Rule

Let's say you want to create a recording rule to aggregate the metrics from multiple targets into one metric. Specifically, you want to calculate the total request rate across all targets for a specific endpoint.


Here's how you could define the recording rule in the Prometheus config:

Define the aggregation

You want to sum the request rates across all targets for a specific endpoint. You can use the sum aggregation function to accomplish this. Here's an example expression:

http_requests_total{endpoint="/api/v1/status"} = sum(rate(http_requests_total{endpoint="/api/v1/status"}[5m])) by (job)

This expression calculates the total request rate across all targets for the /api/v1/status endpoint, using a sliding window of 5 minutes. The by (job) clause groups the results by the job label, identifying the target the metric came from.

Specify the recording Rule

Once you've defined the aggregation expression, you can create a recording rule to store the aggregated data in the new metric you defined. Here's an example of the YAML config:

groups: -
  name: http_requests
  rules: -
    record: http_requests_total:sum
    expr: sum(rate(http_requests_total{endpoint="/api/v1/status"}[5m])) by (job)

This recording rule creates a new group named http_requests, which contains a single rule that uses the sum aggregation expression to calculate the total request rate for the /api/v1/status endpoint. The result is stored in a new metric named http_requests_total:sum. The value record key is used for the new metric name.

With this recording rule in place, you can query the http_requests_total:sum metric using PromQL to see the total request rate across all targets for the /api/v1/status endpoint. This can be useful for monitoring the overall health of your system and identifying performance bottlenecks.

Why Do Recording Rules Matter?

Recording rules enable you to pre-calculate frequently used or compute-intensive computations and store their outcomes as a distinct collection of time series. By querying the pre-computed results, you can significantly speed up the process compared to executing the original computation each time it's required. This capability proves particularly advantageous for dashboards that require repetitive querying of the exact computation whenever they refresh.

What Is Streaming Aggregation?

Streaming aggregation allows users to define aggregation rules that run on ingested data in real-time. Unlike recording rules, streaming aggregation happens in real time during data ingestion.

Streaming Aggregation Workflow
Streaming Aggregation Workflow

In this diagram, you see the process of streaming aggregation. Ingested data from the input stream is processed by the aggregation rules in real-time. The output of the aggregation rules is the aggregated data, which can be used for various purposes, such as real-time monitoring and analysis.

The benefit of streaming aggregation is that it happens during data ingestion, so there is no performance penalty or chance of lag while querying the result of stream aggregation. This means that users can get real-time insights from the data without waiting for the data to be processed later. The data backend is also not overwhelmed with recording rules queries once the data is ingested, effectively aiding in the high performance of time series backends.

💡
Streaming aggregation has been an enduring concept in the realm of data streams processing, unlocking fresh possibilities for observability that might have been overlooked due to worries about performance or the resources needed to store crucial data. This influential concept has the potential to significantly amplify SLO queries, leading to enhanced customer experiences. Furthermore, it facilitates real-time monitoring of system performance and the detection of abnormal infrastructure components without imposing excessive demands on query and storage retention resources. Pursuing such efficiency is unquestionably worth the effort as it significantly increases the time series backend's throughput.

Example of Streaming Aggregation

The process of defining the aggregation rule remains the same using PromQL, but different time series databases may have different ways of evaluating such rules. By default, Prometheus does not support streaming aggregation out of the box.

Levitate - Last9’s time series warehouse recently launched support for stream aggregations via a YAML-based config and GitOps pipeline based on Github or Gitlab CI/CD workflow.

An example of a stream aggregation rule in Levitate looks like the following:

- promql: 'sum2 by (stack, le) (http_requests_duration_seconds_bucket{service="pushnotifs"}[1m])'
as: aggregated_http_requests_duration_seconds_bucket

Read more about Levitate’s. Streaming aggregation support here.

🚿 Streaming Aggregation
This document explains how to setup streaming aggregation pipelines in Levitate.

How Can Recording Rules and Streaming Aggregation Help You?

Recording rules and streaming aggregation can help SRE teams manage metrics more efficiently and effectively. Here are some of the ways these rules can be used to improve the performance, scalability, and availability of your monitoring system:

  • Reliability: By reducing the number of metrics stored in the database, both recording rules and streaming aggregation can improve the reliability of the monitoring system.
  • Scalability: By reducing the amount of data that needs to be stored in the database, recording rules and streaming aggregation can improve the scalability of the monitoring system.
  • Performance: Reducing the amount of data that needs to be processed by time series storage, recording rules, and streaming aggregation can improve the monitoring system's performance.
  • Time to Results Availability: By providing users with more meaningful metrics that can be easily analyzed and acted upon, recording rules and streaming aggregation can reduce the time to results availability.
  • Cost Efficiency vs. Data Loss: Streaming aggregation provides a cost-efficient way to handle high-cardinality metrics without compromising data quality. This is achieved by using aggregation rules that allow users to reduce the amount of data that needs to be stored in the database. However, data loss is potentially risky when using streaming aggregation. This is because the system aggregates data in real time, meaning some data points may be missed if not captured during the aggregation process.

Performance Penalties

Recording rules are applied to the stored data to create new time series. This process happens after the input data streams are ingested, making a performance penalty possible. The performance penalty is because time series storage has to read the data from the database, apply the recording rule, and then store the new time series back into the database.

Recording Rules vs Streaming Aggregation: A Comparison

The below table is a summation of the differences between the two:

Metric Streaming Aggregation Recording Rules
When does it happen? During data ingestion After data is ingested
Performance penalty No Yes
Reliability Reduces the amount of data stored in the database Reduces the number of metrics stored in the database
Scalability Allows users to handle larger volumes of data without compromising performance Reduces the amount of data that needs to be stored in time series storage
Performance Improves performance by reducing the amount of data that needs to be stored in the database Improves performance by reducing the amount of data that needs to be processed in time series storage
Time to results availability Provides users with real-time insights into their metrics Provides users with more meaningful metrics that can be easily analyzed and acted upon
Cost efficiency vs. Data loss Provides a cost-efficient way to handle high-cardinality metrics, but there is a potential risk of data loss Cost can increase because of more compute resources, no possibility of data loss
💡
The Last9 promise — We will reduce your TCO by about 50%. Our managed time series database data warehouse, Levitate, comes with streaming aggregation, data tiering, and the ability to manage high cardinality. If this sounds interesting, talk to us.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors

Last9

Last9 helps businesses gain insights into the Rube Goldberg of micro-services. Levitate - our managed time series data warehouse is built for scale, high cardinality, and long-term retention.