Although Prometheus - an open-source monitoring time series database, has been effective for system and application monitoring, high cardinality is still its major bane. High cardinality metrics can crash Prometheus TSDB, which results in latent Grafana dashboards and causes slow query performance due to high latency.
This comprehensive guide explains how to manage high cardinality metrics in Prometheus to maintain optimal monitoring.
What are High Cardinality Metrics?
Prometheus Label cardinality refers to the number of unique label value combinations in a given metric. In Prometheus, labels are key-value pairs attached to each individual data point, allowing for fine-grained identification and categorization of metrics. The cardinality of a label is the number of distinct values it can take on. Too many unique values of metric labels, such as user_id user email, result in high cardinality.
For instance, in an e-commerce application, monitoring product sales is one metric that can easily cause high cardinality due to dimensions like region, product category, and product type, leading to thousands of unique combinations. But this metric does have use cases for understanding business for product and engineering teams. This is just one example of how high cardinality is a reality today for getting answers from the time series data.
High-label cardinality means that a metric has many unique label combinations. This can have implications for the performance and scalability of Prometheus, as each distinct label combination creates a separate time series in the system. A high cardinality can increase memory usage, CPU usage, longer query times, and higher storage requirements.
Keeping label cardinality within reasonable limits is generally recommended to ensure efficient usage of Prometheus. While no specific threshold is defined as "high" cardinality, it is important to strike a balance between providing sufficient granularity for monitoring needs and avoiding excessive unique label combinations.
Why Do High Cardinality Metrics Matter?
High cardinality data matter because they can significantly impact the performance and availability of time series databases. High cardinality may lead to the loss of data and insights that can be derived from metrics. When fewer data points are retained, it's more challenging to troubleshoot issues and pinpoint their source. Moreover, high cardinality requires scaling, which requires tangible investment in procuring specialized hardware infrastructure and expertise.
Prometheus Cardinality Limits
Why do we run into High Cardinality?
Data is more indispensable and abundant than ever—various systems, including networks and infrastructure components, generate data that inform decisions about service quality, especially in cloud-native environments such as Kubernetes-based deployments. Imagine 100s pods from a Kubernetes cluster scraped by a single prometheus instance, each pod id increasing the cardinality.
More data = more metrics = high cardinality.
Increased adoption of cloud-native technologies—like Kubernetes that generate huge metrics volumes—also contributes to the explosion of metrics. Traditional monitoring that uses static configurations and manual threshold alerts also causes high cardinality since they scrape metrics that may not be relevant to performance.
What is the Maximum Number of Cardinality in Prometheus?
The rule of thumb is to maintain a manageable cardinality. The maximum depends on two things:
Infrastructure size: If a Prometheus server is deployed in a small environment, it cannot handle the same monitoring scale as a Prometheus server deployment on a large one.
Storage allocation: The index-based storage system in Prometheus can lead to premature hardware scaling and infrastructure costs when high cardinality metrics are collected. The storage allocation must, therefore, be well-thought-out to provide optimal storage performance.
Recommended Prometheus Metrics Cardinality
Prometheus’ index-based storage system inverts the index on the labels and the bitmap of the matching time series. Therefore, the index will correspondingly grow for high-cardinality-prone metrics, increasing memory, cpu and disk utilization and slowing down queries.
Although there is no fixed limit on Prometheus’ cardinality, consolidate and optimize metrics to avoid performance and stability issues.
Find High Cardinality Metrics
Prometheus provides several methods for identifying high cardinality metrics. Here's a general guide on how you can do it:
1. Using Prometheus UI:
Visit the Prometheus web interface, usually available at http://<your-prometheus-server-address>:9090. Use the following expressions to find high cardinality metrics:
- {__name__=~".+"}: It returns all series currently in memory. If the result is too high, it signifies high cardinality.
- count({__name__=~".+"}) by (__name__): It returns the count of series per metric name. Look for metrics with a significantly higher count than others.
- topk(10, count by (__name__, job)({__name__=~".+"})): It returns the top 10 highest series counts by metric name and job. It helps to identify the jobs that are producing high cardinality metrics.
- topk(10, count by (__name__, instance)({__name__=~".+"})): This is similar to the above but groups by instance. Useful to identify problematic instances.
2. Using the Prometheus Stats API:
Prometheus also exposes a Stats API endpoint that you can use to gather statistics about its internal metrics. Visit http://<your-prometheus-server-address>:9090/api/v1/status/tsdb to see statistics about the time-series database. Look for the numSeries field to see the total number of series Prometheus stores.
3. Using Grafana dashboards:
Using Grafana, you can import the official Prometheus Stats dashboard (ID: 1860) or Prometheus 2.0 Stats (ID: 3662) to visualize the total number of series and other valuable metrics. These dashboards provide insights into cardinality across all metrics stored in your Prometheus instance.
After identifying the high cardinality metrics, you may need to revise your metrics design or use techniques such as metric relabeling to drop unnecessary metrics or labels. Be aware that removing labels can cause loss of information, so be sure to evaluate the impact before making any changes.
Managing High Cardinality Prometheus Metrics
Handling high cardinality metrics requires sensitivity to the nature of the metrics being measured, as well as the infrastructure in which the Prometheus server is deployed. Prometheus employs three major techniques to handle high cardinality metrics.
Metrics Relabeling
Relabeling in Prometheus is a powerful feature that allows you to modify or filter labels on metrics before they are stored or processed. By using relabeling, you can reduce the cardinality of your metrics, which can help improve performance and resource usage. Here are some relevant examples of relabeling techniques that can be used in Prometheus:
1. Dropping labels: You can drop specific labels from your metrics altogether during the telemetry itself, which reduces the overall cardinality. This can be useful when you have labels that are unnecessary for analysis or when they contribute to excessive cardinality. For example:
- source_labels: [unwanted_label]
action: drop
2. Keeping only selected labels: Sometimes, you may have multiple labels on a metric, but only a subset of those labels is required for analysis. In such cases, you can keep only the necessary labels and drop the rest. Here's an example:
3. Relabeling with a constant value: If you have a label with a high cardinality but know its value is not significant for analysis, you can replace it with a constant value. This effectively reduces the cardinality without losing critical information.
4. Label value mapping: You can sometimes map specific labels to a more general or aggregated value. This can help reduce cardinality while still preserving the essence of the metric. Here's an example:
5. Using regular expressions for relabeling: Regular expressions can be powerful tools. You can match and extract parts of labels using regular expressions and use them to create new labels. This allows you to extract relevant information while reducing the overall cardinality. Here's a basic example:
These are just a few examples of how relabeling can be used in Prometheus to reduce cardinality. Your specific relabeling configuration will depend on your specific use case and the nature of your metrics.
Aggregation is a method that involves combining the metric values of multiple time series to create a single new time series. This approach can significantly reduce the number of time series that Prometheus needs to store and handle, minimizing memory usage, CPU usage, and disk space requirements.
For instance, if you have a metric http_requests_total with two labels: method and status_code, you can use the following query to aggregate the metrics into a single time series:
sum by (method) (http_requests_total)
This PromQL-based query sums up the http_requests_total metric by method, removing the status_code label from the result. This approach eliminates the need to store and manage status_code label values.
You can use recording rules and streaming aggregations to perform these aggregations.
Using bucketing or histogramming techniques for metrics with continuous values to group data into predefined ranges. This reduces cardinality by reducing the number of distinct values while providing insights into data distribution. Prometheus provides the histogram_quantile() function to query aggregated data from histogram metrics.
Rollup and downsampling
You can roll up or downsample data over time, depending on your long-term storage requirements. For example, you might decide to store high-resolution data for a short period and then downsample it to reduce cardinality and save storage space.
Tuning Prometheus server configs
The Prometheus server has various config options that can impact cardinality, such as scrape_interval, scrape_timeout, and evaluation_interval. Fine-tuning these settings can help manage the amount of data collected and processed by Prometheus.
To avoid overload, you can also use the sample limit query, a scrape_cofig field, to aggregate your metrics. For instance, if you want scraping to fail when there are more than 5000-time series returned, you can run the command below:
Instead of collecting every data point, use efficient models like Prometheus histograms to aggregate measurement data. Histograms help you to understand the distribution of a particular quantity, which is determined by the number of buckets in the histogram and tracked by time series. You can then drop the series for needless buckets. This reduces the overall number of data points. You can use the query below to drop metrics of less than ten buckets:
count_over_time(metric_name[1h]) < 10
After identifying the metrics that have less than ten buckets, you can modify the inquiry to include the drop modifier in the query to remove those metrics as follows:
drop count_over_time(metric_name[1h]) < 10
Use efficient PromQL query patterns.
Query patterns like topk and bottomk help reduce the number of data points queries return. Grafana and other visualization tools also enable the customization of smarter dashboards where you can filter data, highlight essential metrics, and identify teams and environments that contribute the most to Prometheus cardinality.
Optimize labels and tags
Enforce similar labels and values across related metrics. Organize tags into a hierarchy or categories to avoid confusion. ONLY define relevant metrics.
Control data frequency via DPM
To optimize your metrics, you may need to inspect your data frequency, which by default is 15 sec, or 4 data points per minute (DPM) for Prometheus using the scrape_interval command. To find out the number of samples scraped over the last minute, you can run the query below:
count_over_time(scrape_samples_scraped[1m])
Implement Appropriate Retention Policies
Determine at what granularity metrics will be stored using benchmarks like the purpose of the metric, the importance of the metric, legal and regulatory constraints, and costs. This helps ensure that the most important metrics are not lost due to storage limitations. Metrics that should only be retained briefly should not stay longer.
If you use your Prometheus with Kubernetes, use Kubernetes annotations to add metadata to your Kubernetes objects, reducing tag usage in Prometheus metrics. Kubernetes annotations are key-value pairs that can be attached to Kubernetes objects such as pods or services.
Adding annotations to applications enhances metrics collection and querying. For example, annotations such as prometheus.io/scrape and prometheus.io/path can instruct Prometheus to scrape specific endpoints or paths for metrics from Kubernetes pods.
Employ Horizontal Scaling
Use horizontal scaling in Kubernetes to spread the load of metrics collection and queries across multiple instances. You can use Prometheus Operator to manage the installation of the Prometheus stack in your Kubernetes cluster. Also known as federation or sharding, each instance collects metrics from a subset of the monitored targets and stores them locally. After that, queries and alerts are federated or sharded across all instances, each serving a subset of the workload. This approach is especially effective for large-scale, distributed systems where the workload is spread across geographically sparse locations.
Conclusion
Managing high cardinality metrics in Prometheus is a crucial task that can significantly impact the performance and scalability of Prometheus. Using techniques discussed in this blog, developers can effectively manage high cardinality metrics while maintaining their data's integrity and accuracy.
However, most techniques require dropping some metrics that may not be useless. With observability tools such as Levitate (a managed Prometheus compatible solution),you can maintain high cardinality metrics without dropping them. It helps you reserve your high cardinality metrics for inspection so that you can make informed decisions to control them instead of losing them mindlessly.
With its advanced algorithms and user-friendly interface, Levitate makes it easy for developers to analyze metrics, draw insights from their applications without worrying about high cardinality, and clearly understand their system's overall health.
Last9 helps businesses gain insights into the Rube Goldberg of micro-services. Levitate - our managed time series data warehouse is built for scale, high cardinality, and long-term retention.