In the realm of data analysis, a critical concept is high cardinality. Often mentioned in discussions about time series databases, data modeling, storage, and analysis, high cardinality plays a crucial role in understanding the complexity and depth of time series data. In simple terms, high cardinality refers to a metric or attribute with a large number of distinct values or unique entities. It signifies the richness, granularity, and diversity encapsulated within a dataset. However, high cardinality brings forth its own set of challenges and considerations when it comes to storage efficiency, query performance, visualization, and overall data analysis. In this post, we delve into the depths of what is high cardinality, exploring its implications and limitations.
What is High Cardinality
Cardinality refers to the number of elements or members in a set. It is a fundamental concept in mathematics, particularly in the field of set theory.
The concept of cardinality extends beyond sets and is used in various mathematical contexts, including functions, graphs, and other structures. It provides a way to compare and classify the "size" or "magnitude" of different mathematical objects based on the number of elements they contain.
In the context of observability, cardinality refers to the number of distinct values or unique entities that a specific attribute or field can have within a system or dataset. It is a measure of the diversity or variability of the data in that particular attribute.
When discussing observability, cardinality becomes relevant in determining the effectiveness of monitoring and analysis. Attributes with high cardinality, meaning a large number of distinct values, tend to provide more granularity and detail for observation. On the other hand, attributes with low cardinality, having only a few distinct values, may limit the level of insight that can be obtained.
In the context of time series data, high cardinality refers to a time series metric or attribute that exhibits a large number of distinct values or unique entities over time. It indicates a significant level of diversity or variability in the values of that specific metric across different time points.
High cardinality in time series data can occur in various scenarios. For example, consider a sensor measuring temperature readings every minute for a year. If each temperature reading is considered as a distinct value, the metric "temperature" would have a high cardinality due to the large number of unique values observed over the course of the year.
High cardinality in time series data can provide valuable insights into the temporal behavior and patterns of a system or phenomenon. By analyzing the diverse values over time, it becomes possible to detect anomalies, identify trends, observe periodic patterns, or capture seasonality in the data.
However, high cardinality in time series data can also present challenges in terms of storage, processing, and analysis. Handling a large number of distinct values over time may require efficient data structures and algorithms to maintain performance. Moreover, visualizing and interpreting high cardinality time series data can be complex, as displaying every unique value may result in cluttered or unreadable plots.
To address these challenges, it's common to aggregate or summarize high cardinality time series data into meaningful intervals or groups. Aggregation techniques such as averaging, min-max calculations, or histogram binning can help reduce the number of distinct values while preserving important patterns or trends.
What is High Cardinality Metrics?
A high cardinality metric refers to a metric or attribute that exhibits a large number of distinct values or unique entities within a dataset or system. It represents a high level of diversity or variability in the values of that particular metric.
High cardinality metrics often provide more detailed and granular information about a system or dataset. They allow for fine-grained analysis and can uncover patterns, trends, or anomalies that may not be apparent when observing metrics with lower cardinality.
For example, in a web application, a high cardinality metric could be the user agent string, which represents the web browser and operating system used by each visitor. Since there are numerous unique combinations of browsers and operating systems, the user agent string metric would have a high cardinality, with a large number of distinct values.
High cardinality metrics often come into play when dealing with user-related data, such as the user ID. User ID is typically an identifier that uniquely identifies each user in a system. However, user ID can exhibit high cardinality, especially in large-scale systems or platforms with a large user base.
In a Kubernetes cluster, high cardinality can arise in various contexts, particularly when dealing with metrics and labels associated with the cluster's resources, workloads, and components. Here are a few examples:
1. Labels and selectors: Kubernetes uses labels and selectors to identify and group resources within the cluster. Labels are key-value pairs attached to objects such as pods, services, or nodes. If there are a large number of distinct label values or a significant number of labels associated with resources, it can result in high cardinality. This can have implications for efficient resource selection, querying, and grouping based on specific labels.
2. Pod or container names: Each pod or container within a Kubernetes cluster typically has a unique name. In large-scale deployments with numerous pods or containers, the sheer number of distinct names can lead to high cardinality. This can impact various operations, such as log aggregation, monitoring, or debugging, where filtering or analysis based on individual pod or container names is required.
3. Metrics and monitoring: Kubernetes monitoring systems often collect and analyze various metrics related to the cluster's resources, such as CPU usage, memory consumption, or network traffic. When these metrics are tagged with labels, the unique combination of a large number of metrics and distinct label values can result in high cardinality. Handling and analyzing such high cardinality metrics efficiently can be challenging, requiring careful consideration of storage, query performance, and visualization techniques.
4. Namespaces: Kubernetes namespaces provide a way to logically isolate resources within a cluster. If there are many distinct namespaces within a cluster, it can contribute to high cardinality. Operations like resource allocation, quota management, or access control can be impacted by high cardinality namespaces.
5. Custom labels and annotations: Kubernetes allows users to define custom labels and annotations for resources, providing flexibility and extensibility. However, an excessive number of custom labels or annotations, or a large number of unique values for these custom attributes, can contribute to high cardinality. This can impact querying, filtering, and resource management workflows.
High Cardinality Limitations
While cardinality is a fundamental concept in data analysis, there can be limitations and challenges associated with high cardinality. Some of these limitations include:
1. Storage requirements: High cardinality metrics can result in a significant increase in storage requirements. Storing a large number of distinct values requires additional disk space, which can become a concern in resource-constrained environments or when dealing with a massive amount of data.
2. Performance impact: High cardinality metrics can impact system performance, especially during data ingestion, querying, and analysis. Handling a large number of unique values requires additional processing and memory resources, potentially leading to slower query response times and increased resource utilization.
3. Scalability issues: As the cardinality of metrics increases, the scalability of data processing systems may be affected. Handling a large number of distinct values can strain the capacity of the underlying infrastructure, limiting the system's ability to handle increasing workloads.
4. Query efficiency: When querying high cardinality metrics, it becomes more challenging to retrieve and analyze data efficiently. The increased number of distinct values may result in longer query evaluation times and more complex data retrieval operations, impacting real-time analysis or time-sensitive applications.
5. Visualization and analysis complexity: Visualizing high cardinality data can be challenging, especially in traditional charting tools. Displaying every unique value may lead to cluttered or unreadable visualizations. Aggregation into single metric or sampling techniques may be required to reduce the complexity and improve interpretability of the data.
6. Indexing and metadata management: Managing indexes and metadata for high cardinality metrics can be resource-intensive. The system needs to maintain efficient data structures to support fast querying and filtering based on the distinct values of the metric.
To overcome these limitations, it's important to carefully evaluate the necessity of high cardinality metrics, consider appropriate data modeling and aggregation strategies, and optimize the observability platform’s resources and configurations accordingly. It may also be beneficial to explore specialized databases or techniques designed to handle high cardinality data, such as time series databases like Levitate or distributed systems with built-in cardinality management capabilities.
Prometheus is an open-source monitoring and alerting system commonly used for collecting and analyzing time series data. One of the challenges that Prometheus faces, especially in larger deployments, is handling high cardinality metrics efficiently.
High cardinality in Prometheus refers to time series metrics that have a large number of distinct labels or label value combinations. Labels in Prometheus are key-value pairs associated with each unique time series, which allow for dimensional data modeling and querying. However, when the number of unique label value combinations becomes large, it can strain the resources and performance of the Prometheus server.
High cardinality metrics can lead to increased storage requirements, longer query evaluation times, and increased memory usage. As the number of distinct label values grows, the Prometheus server needs to maintain indexes and metadata for efficient querying, which can impact scalability and resource utilization.
To handle high cardinality in Prometheus effectively, consider the following best practices:
1. Select labels carefully: Use labels judiciously and avoid creating labels with a large number of distinct values unless necessary. Evaluate the cardinality implications before adding new labels to time series.
2. Use relabeling and aggregation: Prometheus provides relabeling and aggregation mechanisms to reduce the cardinality of metrics. By applying appropriate relabeling rules and aggregating data at collection time, you can reduce the number of distinct label value combinations.
3. Enable chunking and compression: Configure Prometheus to use chunking and compression mechanisms to optimize storage efficiency. This helps reduce the disk space required for storing high cardinality metrics.
4. Use metric naming conventions: Employ a consistent and meaningful naming convention for metric names. Well-defined naming conventions make it easier to manage and analyze high cardinality metrics.
5. Leverage downsampling: If the high cardinality metric does not require high-resolution data, consider downsampling the data to reduce storage and processing overhead. Downsampled metrics can be stored in a separate TSDB or aggregated over longer time intervals.
6. Monitor and optimize: Continuously monitor the resource utilization of Prometheus, including memory usage, disk space, and query latencies. Optimize the Prometheus config based on the observed performance characteristics and cardinality patterns of your specific use case.
Cardinality Spikes
Multiplicative cardinality can demonstrate the exponential growth of cardinality when dealing with a large number of dimensions or attributes. Let's consider an example where the cardinality of each dimension is in the millions.
Suppose we are monitoring a cloud-native platform that consists of the following dimensions:
In this example, we have four dimensions with varying numbers of distinct values. The Region dimension has four distinct values, the Service dimension has 1000 distinct values, the Environment dimension has three distinct values, and the Instance dimension has 10,000 distinct values.
To calculate the multiplicative cardinality and showcase the cardinality growth, we multiply the number of distinct values in each dimension:
4 (Region) x 1000 (Service) x 3 (Environment) x 10,000 (Instance) = 12,000,000
As a result, the multiplicative cardinality of this example is 12,000,000. This means that there are 12 million unique combinations or tuples that can be formed by taking one value from each dimension. Each combination represents a specific context or configuration within the observability data.
The cardinality growth to millions in this example highlights the the scenarios of cardinality spikes.
High Cardinality Metrics in Grafana
High cardinality metrics can present challenges when used with Grafana.
1. Query performance: High cardinality metrics can impact the performance of data queries in Grafana. When executing queries that involve high cardinality dimensions or involve filtering based on individual values, query response times may increase due to the larger number of unique values to process. The increased computational load can strain the data source and result in slower query execution.
2. Visualization complexity: High cardinality metrics can lead to complex and cluttered visualizations in Grafana. Displaying every distinct value directly on the chart may overwhelm the visualization with too many data points, making it difficult to interpret and analyze. Simplification techniques such as data aggregation, grouping, or sampling may be necessary to create more readable and informative visualizations.
3. Resource utilization: Grafana's resources, such as memory and processing power, can be impacted by high cardinality metrics. Storing and querying a large number of distinct values requires additional resources, potentially affecting the scalability and performance of Grafana. Adequate resource allocation and optimization strategies may be required to handle the increased demands.
4. Dashboard design considerations: Designing dashboards that effectively present high cardinality metrics can be challenging. It's important to carefully select which dimensions and values to display, avoiding overwhelming users with excessive options. Using variables and templating features in Grafana can help navigate and filter the data, enabling more focused analysis without overwhelming the user interface.
5. Alerting and anomaly detection: Defining alerts or anomaly detection rules for high cardinality metrics may require careful consideration. Setting up alerts at the individual value level may not be practical due to the large number of unique values. Instead, aggregating data or defining alerts based on summarized metrics, patterns, or statistical thresholds can provide more meaningful insights and reduce false positives.
Retention Strategies for High Cardinality Metrics
Retention refers to the duration for which data is stored and retained in a system. Retention policies determine how long data is kept before it is discarded or archived. Retention periods are typically defined based on business requirements, regulatory compliance, and storage considerations. Longer retention periods require more storage resources and may impact query performance, while shorter retention periods may result in data loss and limited historical analysis.
High cardinality metrics can influence retention decisions. Storing high cardinality metrics for a longer retention period can consume significant storage resources due to the large number of distinct values. It's important to consider the trade-off between the value of retaining detailed data for analysis and the associated costs in terms of storage and query performance.
Conclusion
In conclusion, high cardinality is a critical concept in observability, particularly in the realm of time series data. It refers to metrics or attributes with a large number of distinct values or unique entities. High cardinality provides richness, granularity, and diversity in datasets, enabling detailed analysis and uncovering patterns and anomalies. However, high cardinality also presents challenges in terms of storage efficiency, query performance, and visualization.
In the subsequent posts, we will talk about how to manage high cardinality in more detail. Stay tuned.
Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.