Downsampling & Aggregating Metrics in Prometheus: Practical Strategies to Manage Cardinality and Query Performance

Prometheus has become indispensable in the observability toolkit, offering robust monitoring capabilities critical for the performance of cloud-native applications and distributed systems. However, as the scale of data grows, so does the challenge of managing it effectively.

High cardinality data can lead to performance bottlenecks, making downsampling an essential strategy. This article explores the concept of downsampling in Prometheus, the use of recording rules, the role of tools like Thanos, and other approaches to managing data efficiently.

The Cardinality Challenge in Prometheus

Cardinality refers to the number of unique sets of labels for a given metric in a time series database like Prometheus. High cardinality metrics can significantly slow query performance and consume more storage, leading to increased costs and operational challenges. Identifying, managing, and mitigating high cardinality is crucial for maintaining an efficient monitoring system.

Related Read - What is High Cardinality?

Understanding Downsampling in Prometheus

Before diving into how to downsample within Prometheus, it's crucial to understand what downsampling is and why it's needed.

Downsampling is the process of reducing the resolution of (time-series) data. This is typically done by aggregating multiple data points into a single point based on a specific interval, which can be through averaging, summing, or taking the minimum or maximum values.

In the context of Prometheus, downsampling can help in several ways:

Reduces storage requirements: By storing fewer data points over the long term, you can significantly decrease the required disk space.
Improves query performance: Fewer data means faster queries, as Prometheus has fewer points to process when answering a query.
Maintains older data at a usable resolution: Without downsampling, you might have to discard older data sooner to save space. Downsampling allows you to keep historical data longer but at a lower resolution.

However, downsampling is not without trade-offs.

Downsides of downsampling

The primary downside is the loss of detail in your data. For instance, if you downsample data from 1-second intervals to 1-minute intervals, you lose the granularity of those 59 seconds between each minute. This might be acceptable for long-term trends but could obscure important details when troubleshooting issues.

In Prometheus, downsampling isn't a native capability. As such, it requires either using recording rules to pre-aggregate data or integrating external tools like Levitate Aggregator Unit, Thanos, or M3, which can handle downsampling or aggregation more gracefully.

With this understanding of downsampling, we can explore how Prometheus recording rules can be used to achieve a basic level of downsampling and the limitations of this approach.

How to Downsample in Prometheus

Using Recording Rules

Prometheus offers a feature known as recording rules that allow users to precompute frequently needed or computationally expensive expressions and save their results as a new set of time series. Recording rules can be used to downsample data by creating a new metric that represents a lower-resolution view of the data.

For example, if you have a metric that records data every second, you could create a recording rule that averages this data over a minute:

groups:
- name: downsampled_metrics
  interval: 1m
  rules:
  - record: job:request_duration_seconds:avg_1m
    expr: avg_over_time(job:request_duration_seconds[1m])

However, this approach has its drawbacks. It increases the complexity of your Prometheus configuration and can lead to confusion over which metrics to query. Additionally, it doesn't reduce the number of samples stored for the original high-resolution metric, meaning the storage savings are less than expected.

Read more on differences beteween Recording Rules vs. Streaming Aggregation.

Adjusting Scrape Intervals

One of the simplest forms of downsampling is to increase the scrape interval of Prometheus. By scraping metrics less frequently, you inherently reduce the resolution of your data.

Capability: Easy to implement; just a configuration change.
Feature: Reduces the amount of data ingested in real time.
Drawback: Can miss short-term spikes or anomalies that could be critical.

Prothemeus Downsample using External Tools

Several external tools can be integrated with Prometheus to provide downsampling capabilities. These tools can ingest high-resolution data and store it in a downsampled format.

Levitate Aggregator Unit

Streaming aggregation allows users to define aggregation rules that run on ingested data in real time. Unlike recording rules, streaming aggregation happens in real time during data ingestion, before the data is stored. This allows an alternate way to the Prometheus Aggregator technique which is usually via Recording Rules.

💡

Levitate offers a streaming aggregation pipeline that can be used instead of recording rules to pre-aggregate high cardinality metrics. Read more here.

The Thanos Approach to Downsampling

Thanos is a set of components that can be added to existing Prometheus deployments to extend their functionality. One of the critical features of Thanos is its ability to downsample data. Thanos takes high-resolution data and reduces its resolution, decreasing the data stored and improving query performance.

However, Thanos has its drawbacks. It introduces additional complexity to your monitoring system and requires careful configuration and management. Moreover, downsampling can lead to losing detail in your metrics, which might be critical for diagnosing issues.

Other tools & approaches

M3 is an open-source metrics platform that integrates with Prometheus and provides native support for downsampling.

Capability: Designed for scalable, long-term storage of time-series data.
Feature: Provides highly configurable downsampling options.
Drawback: Requires separate deployment and management.

Cortex

Cortex is another horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.

Capability: Supports downsampling and other features suitable for large-scale deployments.
Feature: Offers a built-in block storage system that handles downsampling.
Drawback: Complexity in setup and tuning for optimal performance.

Custom Scripts and Batch Jobs

Custom scripts or batch jobs can be written for some use cases to process the raw data and store it in a downsampled format.

Capability: Highly customizable to specific downsampling needs.
Feature: This can be tailored to perform complex downsampling operations not supported by other tools.
Drawback: Requires development and maintenance of custom code, which can be error-prone and resource-intensive.

Managing High Cardinality Without Downsampling

While downsampling is an effective technique for managing high cardinality metrics, the tradeoff of losing resolution has multiple implications in your overall monitoring strategy.

In our comprehensive guide for managing high cardinality metrics in Prometheus we have gone through multiple other techniques like relabeling, horizontal scaling, aggregation, etc.

Levitate’s Superior Cardinality Support and Features

Using our superior default support for high cardinality and long retention, Levitate, our managed Prometheus-compatible hosted solution ensures you never compromise on your cardinality needs. Decide to drop them at the source, reduce labels, or tame the cardinality explosion using streaming aggregations, our built-in Prometheus aggregator.

Learn more about Levitate’s High Cardinality Support

Prometheus Downsampling Best Practices

Balancing Granularity and Performance

Finding the right balance between granularity and performance is key to effective downsampling. While high granularity provides detailed information, it can lead to performance issues. Conversely, too much downsampling can obscure important details. It's essential to understand the monitoring needs of your system to strike the right balance.

Monitoring Downsampling Effectiveness

Once you've implemented a downsampling strategy, monitoring its effectiveness is essential. Prometheus offers tools to measure the performance of queries and the data storage size. By keeping an eye on these metrics, you can adjust your downsampling strategy to ensure that it continues to meet your system's needs.

In Conclusion

In conclusion, the choice of downsampling technique or tool should be guided by the specific requirements of your monitoring system, such as the desired balance between data granularity, storage costs, and query performance. It's also important to consider the operational overhead and complexity that each option introduces to your Prometheus ecosystem.

Downsampling & Aggregating Metrics in Prometheus: Practical Strategies to Manage Cardinality and Query Performance

The Cardinality Challenge in Prometheus

Understanding Downsampling in Prometheus

Downsides of downsampling

How to Downsample in Prometheus

Using Recording Rules

Adjusting Scrape Intervals

Prothemeus Downsample using External Tools

Levitate Aggregator Unit

The Thanos Approach to Downsampling

Other tools & approaches

Managing High Cardinality Without Downsampling

Levitate’s Superior Cardinality Support and Features

Prometheus Downsampling Best Practices

Balancing Granularity and Performance

Monitoring Downsampling Effectiveness

In Conclusion

Contents

Newsletter

Handcrafted Related Posts

Downsampling & Aggregating Metrics in Prometheus: Practical Strategies to Manage Cardinality and Query Performance

A comprehensive guide to downsampling metrics data in Prometheus with alternate robust solutions

Whitespace in OTLP headers and OpenTelemetry Python SDK

How to handle whitespaces in the OTLP Headers with Python Otel SDK

Guide to Service Level Indicators and Setting Service Level Objectives

A guide to set practical Service Level Objectives (SLOs) & Service Level Indicators (SLIs) for your Site Reliability Engineering practices.