High cardinality is a key concept in data analysis, especially when dealing with time series data and complex datasets. It refers to metrics or attributes that have a wide range of unique values, often indicating a high level of detail and variation within the data.
While high cardinality can provide valuable insights, it also presents challenges around storage, query speed, and data visualization.
In this post, we’ll explore what high cardinality means, its impact on data systems, and the considerations it brings for efficient analysis.
Definition of High Cardinality
High Cardinality: A measure of the number of elements in a set or the uniqueness of data values in a data set. In observability, it refers to metrics or attributes with numerous distinct values, providing detailed insights but posing challenges in data management and analysis.
High cardinality metrics refer to metrics or attributes within a dataset that have a large number of distinct values. These metrics offer detailed, granular insights, often revealing patterns, trends, or anomalies that may not be visible with lower cardinality metrics.
For example, in a web application, a high cardinality metric could be the user agent string. Since there are countless unique combinations of browsers and operating systems, this metric would have a large number of distinct values, making it highly diverse. Similarly, a user ID in a system with a large user base is another common example of high cardinality.
What is High Cardinality in Time Series Data?
High cardinality in time series data refers to a metric or attribute that exhibits a large number of distinct values or unique entities over time. It indicates significant diversity or variability in the values of a specific metric across different time points.
Key Points:
Reflects diverse or variable values over time
Provides detailed insights into temporal behavior
Can present challenges in data management and analysis
Example:
Consider a temperature sensor recording measurements every minute for a year:
Total readings: 525,600 (365 days * 24 hours * 60 minutes)
Each unique temperature value contributes to high cardinality
The metric "temperature" has high cardinality due to numerous unique values
Benefits:
Detailed insights into system behavior
Ability to detect subtle anomalies
Identification of precise trends and patterns
Enhanced capability for fine-grained analysis
Challenges:
Increased storage requirements
Complex data processing needs
Potential performance issues in analysis
Difficulties in data visualization
Solutions for Managing High Cardinality Time Series Data
Data aggregation techniques:
Averaging
Min-max calculations
Histogram binning
Efficient data structures and algorithms
Specialized time series databases
Advanced visualization tools
High Cardinality in Kubernetes
Kubernetes clusters can encounter high cardinality in various contexts, especially when dealing with metrics, labels, and workloads.
Here are a few examples:
Labels and Selectors: Kubernetes uses labels to group resources. When many distinct label values are applied to resources like pods or services, this can result in high cardinality, affecting how resources are selected and queried.
Pod or Container Names: In clusters with many unique pods or containers, the distinct names assigned to each can create high cardinality, especially during monitoring, logging, or debugging tasks.
Metrics and Monitoring: Kubernetes monitoring tools collect a vast array of metrics (CPU usage, memory, network traffic). When these metrics are tagged with unique labels, high cardinality can arise, impacting query performance and data visualization.
Managing high cardinality in Kubernetes requires careful handling to ensure efficient performance, storage management, and effective query capabilities.
High Cardinality Limitations
While high cardinality metrics provide valuable granular insights, they also present certain limitations and challenges:
Storage Requirements: High cardinality metrics can significantly increase storage demands. Storing a vast number of unique values requires more disk space, which may become an issue in environments with limited resources or massive datasets.
Performance Impact: Managing and analyzing high cardinality data can slow down performance, particularly during data ingestion and querying. The additional processing power and memory required can result in slower response times and higher resource utilization.
Scalability Issues: As cardinality increases, scalability can become a concern. The system may struggle to handle the rising number of distinct values, putting a strain on infrastructure and limiting its ability to manage larger workloads.
Query Efficiency: Retrieving data from high cardinality metrics can lead to longer query times. The complexity involved in handling a large number of distinct values can slow down analysis, especially in time-sensitive applications.
Visualization and Analysis Complexity: Traditional visualization tools may struggle to present high cardinality data effectively. Displaying too many unique values can result in cluttered, hard-to-read charts, necessitating aggregation or sampling techniques to simplify the data.
Indexing and Metadata Management: Maintaining indexes and metadata for high cardinality metrics can be resource-intensive. Efficient data structures are needed to support fast querying, which adds to the system’s overhead.
To mitigate these challenges, it's essential to evaluate the necessity of tracking high cardinality metrics.
Consider employing data aggregation strategies, optimizing system configurations, or leveraging specialized databases like time series databases (e.g. Last9) or distributed systems that offer better management for high cardinality data.
Prometheus, a widely used open-source monitoring and alerting tool, often faces challenges when dealing with high cardinality metrics, especially in large-scale deployments.
In Prometheus, high cardinality refers to the large number of unique label combinations associated with time series metrics. Labels, which are key-value pairs that describe the metrics, allow for powerful querying and dimensional analysis.
However, when these labels result in a vast number of distinct combinations, it can put a strain on Prometheus resources, leading to increased memory usage, slower queries, and higher storage demands.
To manage high cardinality effectively in Prometheus, here are a few strategies:
Careful Label Selection: Avoid using too many labels or labels with an excessive number of distinct values unless necessary. Evaluate the potential impact on cardinality before introducing new labels.
Relabeling and Aggregation: Utilize Prometheus’s relabeling and aggregation capabilities to reduce cardinality. Relabeling allows you to modify or drop labels, while aggregation helps combine similar data points, minimizing distinct time series.
Chunking and Compression: Configure Prometheus to take advantage of chunking and compression to optimize storage. This reduces the disk space required for high cardinality metrics.
Metric Naming Conventions: Implement clear and consistent naming conventions for your metrics. Meaningful names make it easier to manage high-cardinality data and ensure better organization.
Downsampling: When high-resolution data isn’t necessary, downsampling metrics can help. Aggregating data over longer intervals or storing downsampled data in a separate database can reduce storage and query overhead.
Monitoring and Optimization: Continuously monitor the performance of your Prometheus server, including memory, disk usage, and query latencies. Adjust configurations based on performance patterns and resource utilization to keep your system running smoothly.
Cardinality Spikes
Multiplicative cardinality can demonstrate the exponential growth of cardinality when dealing with a large number of dimensions or attributes. Let's consider an example where the cardinality of each dimension is in the millions.
Suppose we are monitoring a cloud-native platform that consists of the following dimensions:
In this example, we have four dimensions with varying numbers of distinct values. The Region dimension has four distinct values, the Service dimension has 1000 distinct values, the Environment dimension has three distinct values, and the Instance dimension has 10,000 distinct values.
To calculate the multiplicative cardinality and showcase the cardinality growth, we multiply the number of distinct values in each dimension:
4 (Region) x 1000 (Service) x 3 (Environment) x 10,000 (Instance) = 12,000,000
As a result, the multiplicative cardinality of this example is 12,000,000. This means that there are 12 million unique combinations or tuples that can be formed by taking one value from each dimension. Each combination represents a specific context or configuration within the observability data.
The cardinality growth to millions in this example highlights the scenarios of cardinality spikes.
High Cardinality Metrics in Grafana
When dealing with high cardinality metrics in Grafana, certain challenges can arise, affecting performance, visualization, and resource usage:
Query Performance: High cardinality metrics can slow down queries in Grafana, as filtering and processing numerous distinct values puts a higher load on the data source, increasing response times.
Visualization Complexity: Displaying a large number of distinct values directly on Grafana charts can lead to cluttered, hard-to-read visuals. Simplifying visualizations through data aggregation, grouping, or sampling can help make the charts more interpretable.
Resource Utilization: Storing and querying high cardinality metrics demands more memory and processing power, which may affect the performance and scalability of Grafana. Proper resource management and optimization strategies are key to handling these increased requirements.
Dashboard Design Considerations: When designing dashboards for high cardinality metrics, it's crucial to avoid overwhelming users. Utilizing variables, filters, and Grafana's templating features can help streamline the data and offer a more focused analysis.
Alerting and Anomaly Detection: Setting up meaningful alerts for high cardinality metrics can be tricky. Instead of alerting on individual values, it’s often more effective to aggregate data and establish alerts based on patterns, summaries, or statistical thresholds to reduce false positives.
Managing high cardinality metrics in Grafana requires careful planning and optimization to maintain performance and provide clear, actionable insights.
Retention Strategies for High Cardinality Metrics
Retention policies define how long data is stored before being discarded or archived, balancing business needs with storage and performance considerations.
High cardinality metrics, with their large number of distinct values, can heavily influence retention decisions due to their resource-intensive nature.
Here are some strategies to manage high cardinality metrics:
Shorten Retention Periods: Reduce the retention period for high cardinality metrics to minimize storage costs. Retain only the most recent or relevant data to support real-time analysis while limiting historical data retention.
Aggregate Data: Implement data aggregation to retain summarized or averaged metrics over time, reducing the volume of high cardinality data without losing meaningful insights. This helps retain historical trends without overwhelming storage.
Downsampling: Use downsampling techniques to store less granular data for older periods. This reduces the cardinality while still preserving broader trends for long-term analysis.
Separate Retention Policies: Apply different retention policies for high cardinality and low cardinality metrics. Critical high-cardinality metrics may need shorter retention, while low-cardinality metrics can be stored for longer periods for detailed analysis.
Optimize Storage Systems: Leverage storage systems or databases optimized for high cardinality data, such as time series databases that use compression and efficient indexing to handle large datasets.
Conclusion
In conclusion, high cardinality is a critical concept in observability, particularly in the realm of time series data. It refers to metrics or attributes with a large number of distinct values or unique entities. High cardinality provides richness, granularity, and diversity in datasets, enabling detailed analysis and uncovering patterns and anomalies. However, high cardinality also presents challenges in terms of storage efficiency, query performance, and visualization.
In the subsequent posts, we will talk about how to manage high cardinality in more detail. Stay tuned.
FAQs
What is the meaning of cardinality? Cardinality refers to the number of unique values in a dataset column or attribute. In data modeling, it's a crucial concept for understanding the structure and complexity of data.
What is high cardinality and low cardinality?
High cardinality: A column or attribute with a large number of unique values, such as timestamps, user IDs, or IP addresses.
Low cardinality: A column with few unique values, like boolean fields (true/false) or status categories (active/inactive/pending).
What is an example of high cardinality? An industrial IoT system collecting sensor data every second would have high cardinality in its timestamp column, with millions of distinct values over time.
Is high cardinality good or bad? High cardinality isn't inherently good or bad; it depends on the use case:
Advantages: Provides detailed, granular data for in-depth analysis.
Challenges: Can impact database performance, query optimization, and storage requirements.
What is a high degree of cardinality? A high degree of cardinality refers to columns or attributes with thousands or millions of unique values. The exact threshold varies based on the context and data volume.
How does high cardinality impact database performance? High cardinality can affect database performance by:
Increasing storage requirements
Slowing query execution times
Requiring more complex indexing
Complicating data partitioning and schema design
How does high cardinality affect query optimization? High cardinality can impact query optimization by:
Increasing the time needed for table scans
Complicating the creation of efficient indexes
Making it harder for query optimizers to choose the best execution plan
Potentially leading to more memory usage for query processing
What are the best techniques for optimizing data transfer with high cardinality data? For high cardinality data, optimization techniques include:
Using columnar storage formats (e.g., Parquet)
Implementing data compression
Using time-series databases like InfluxDB or TimescaleDB
Employing partitioning strategies
Using effective data modeling techniques
Do you need every unique value you're storing? Not always. Strategies to manage this include:
Data aggregation for historical data
Downsampling high-frequency time series data
Using a separate column for less granular grouping
Implementing a retention policy for time-sensitive data
Why is understanding high-cardinality data important? Understanding high-cardinality data is crucial for:
Making informed data modeling decisions
Optimizing query performance
Managing storage requirements and costs
Enabling granular analysis
Choosing appropriate database technologies and monitoring systems
Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.