Metrics explosion in time series data is problematic because it obscures valuable insights about system performance, especially in cloud-native environments with petabytes of metrics and metadata. High cardinality is the chief offspring of metrics explosion, and Prometheus’ inability to resolve high cardinality issues further complicates the situation. This article defines high cardinality in a cloud-native environment and explores practical ways for SRE and DevOps teams to minimize and control it.
What is High Cardinality?
High cardinality occurs when a metric has many unique values or categories regarding a dataset or data field. It refers to a situation where a metric has a diverse range of values or categories, each of which has a relatively small number of instances compared to the total number of values or categories.
For example, a metric that tracks user behavior on a website is susceptible if the website has hundreds of pages and visits to each page are tracked.
Similarly, an online store’s customer ID field containing a dataset of customer information may have high cardinality if there are thousands or millions of unique customer IDs in the dataset. This would make it difficult to analyze the data because there are so many customer IDs to consider.
Why Does High Cardinality Matter?
High cardinality matters because it can significantly impact the performance of systems. Monitoring tools or data analysis systems such as Prometheus - a popular open-source time series database may require increased processing times and storage costs for high cardinality workloads. This is particularly challenging in large, distributed computing environments with several instances or microservices running simultaneously.
Higher cardinality can also result in slower query performance and increased query latency, which can hamper debugging. This can negatively impact the responsiveness of analytics systems, leading to delays in detecting issues or identifying opportunities for system optimization.
For example, if a monitoring system tracks individual requests made to a web application, high cardinality could result in too many unique data points, making it difficult to understand the system's overall health. As a result, cloud-native applications and services require monitoring and alerting tools that can handle high cardinality without negatively impacting performance.
The observability platform must be evaluated based on whether it can handle high cardinality in cloud native environments so as to not cause degradations and outages.
High Cardinality in Cloud Native Environments
Containerization is a key characteristic of cloud-native architectures and microservices-based deployments. Each container contains instances—with unique IDs—that represent a unique service. The metrics of such metadata usually have high cardinality. Moreover, when these containers are updated or restarted, the metadata changes, generating even more unique values and worsening the problem.
Another factor that contributes to high cardinality is the use of ephemeral infrastructures. These refer to systems, such as virtual machines, that are created, used for short periods, and then destroyed. They enable fast and efficient scaling of services.
However, because ephemeral infrastructures are constantly created and destroyed, they generate a large number of unique data points, contributing significantly to high cardinality. In addition, metrics collected on individual ephemeral infrastructures become outdated quickly as they are destroyed, making them difficult to monitor historically.
- Labels and Annotations: Kubernetes allows you to add labels and annotations to resources like pods, services, and nodes. These labels and annotations provide metadata and help in categorizing and organizing resources. However, if you have a large number of unique labels or annotations across your cluster, it can lead to high cardinality. For example, if you have thousands of unique labels assigned to your pods, it can impact the performance of queries and filtering operations that rely on these labels.
- Custom Metrics: Kubernetes supports custom metrics, which allow you to define and collect application-specific metrics using the Kubernetes Metrics API. If you have a high number of unique custom metrics being generated by your applications, it can result in high cardinality. Each unique metric increases the complexity of data collection, storage, and analysis, and can potentially strain your monitoring systems.
- Pod or Container Names: In a Kubernetes cluster, each pod or container has a unique name. If you have a large number of pods or containers with unique names, it can contribute to high cardinality. This can impact monitoring and querying operations that rely on pod or container names to identify and analyze specific instances.
- Resource Identifiers: Kubernetes resources like pods, services, and deployments are assigned unique identifiers, such as UUIDs. If you have a large number of unique identifiers across your cluster, it can lead to high cardinality. This can affect the performance of operations that involve searching, grouping, or filtering based on these identifiers.
Challenges with Grafana and High Cardinality
Grafana may face specific challenges in cloud-native environments related to high cardinality data. Here are some specific high-cardinality challenges you may encounter when using Grafana in a cloud-native environment:
- Data Source Limitations: Grafana relies on data sources to retrieve metrics and other monitoring data. Some data sources, especially traditional time-series databases, may have limitations in handling high cardinality data. These databases may struggle to efficiently store and query large volumes of unique data points, impacting Grafana's ability to retrieve and visualize the data.
- Query Performance: When dealing with high cardinality data, the performance of queries in Grafana can be affected. Queries involving high cardinality fields or complex filtering can take longer to execute and consume more system resources. As a result, query response times may increase, negatively impacting the user experience.
- Dashboard Loading Times: High cardinality data sets can lead to increased dashboard loading times in Grafana as unique time series explode. When a dashboard has multiple panels and each panel queries high cardinality data sources, the loading process can become slower. Users may experience delays when navigating between dashboards or when the browser needs to load and render a large number of data points.
- Visualization Scalability: High cardinality data can pose challenges in visualizations. When plotting charts or graphs in Grafana, a large number of unique data points can make visualizations overcrowded and hard to interpret. Scalability issues may arise when attempting to display all the data points, and the charts may become visually cluttered.
- Resource Consumption: High cardinality data requires more system resources, including CPU, memory, and storage. Grafana's infrastructure needs to be appropriately scaled to handle the increased resource demands. Inadequate resource allocation can lead to performance issues, slowdowns, or even crashes when dealing with high cardinality data.
- Templating and Variable Management: Grafana's templating and variable features allow for dynamic dashboards and filtering options. However, with high cardinality data, managing and populating variables can become complex. Generating variable options based on high cardinality fields may require additional processing power and can impact the performance of variable-dependent queries.
Manage High Cardinality in Cloud-native environments
To address high cardinality in cloud-native environments, several techniques can be employed:
- Tagging and metadata: Instead of treating each unique value as a separate dimension, you can use tags or metadata to categorize and aggregate related values. This reduces the overall cardinality and simplifies querying.
- Aggregation and summarization: Instead of storing every individual data point, you can aggregate and summarize the data at various levels of granularity. For example, you can compute averages, percentiles, or histograms over a time interval or a set of related dimensions.
- Data retention policies: Consider defining data retention policies to limit the duration for which high-cardinality data is stored. This allows you to focus on recent or relevant data while discarding older or less important information.
- Sampling: Instead of collecting and storing data for every instance, you can sample the telemetry data. This reduces the cardinality while still providing a representative view of the system's behavior.
- Time-series databases: Consider leveraging specialized time-series databases such as Levitate that are designed to handle high-cardinality data, efficiently. These TSDB backends often provide optimizations handling high time-series data. Levitate provides numerous such techniques including streaming aggregations and cardinality limiter.
By applying these techniques, you can manage high-cardinality challenges in cloud-native environments more effectively, enabling better monitoring, troubleshooting, and observability of your systems.
There are other ways that can be applied during ingestion and after it to control high cardinality.
Dropping Labels
Dropping certain labels from metrics, such as the removal of the unique identifier for each data point, helps reduce high cardinality. While this approach can be effective, it can also reduce the usefulness of the metrics data and make it harder to troubleshoot problems or understand performance issues.
Recording Rules
Recording rules are periodic queries executed by Prometheus, the results of which are saved as new time series. They allow users to precalculate more complex and high cardinality metrics. This pre-calculation helps reduce overall metrics cardinality by computing values ahead of time. However, pre-calculation can be impossible or inefficient when the number of possible combinations of labels is too vast. It is also time-consuming to set up and may not be flexible enough to capture all needed information.
Streaming Aggregations
Streaming aggregation groups data on the fly in real-time, queries incoming data streams, and improves horizontal scalability. It is most appropriate for cloud native environments, IoT data streams and other high-volume data sources. Streaming aggregation is superior to the other approaches discussed for the following reasons.
- Streaming allows for real-time processing of metrics, optimizing performance so that metrics can be processed more efficiently.
- It reduces storage requirements and costs by summarizing data on the fly, since data aggregates can start from much lower cardinality values.
- Unlike dropping labels or recording rules which can result in permanent value loss, streaming aggregation retains metadata and gives more specific numerical information.
- Streaming tools are also flexible, as they can be customized depending on the specific needs of the developer.
- Streaming aggregation reduces the complexity of data analysis by providing summarized information that is easier to interpret.
- Streaming aggregations are compatible with various data storage formats, making them ideal for businesses working with multiple data sources.
- Streaming aggregations enable faster detection of potential security breaches that could cause harm to the application, promoting cybersecurity best practices.
Conclusion
High cardinality is an inevitable challenge in cloud-native environments. It is thus essential to have a robust observability platform that can handle high cardinality metrics efficiently without compromising data storage, query speed, and performance to support all the use cases that SRE and DevOps teams want to solve for.