Last9 Last9

Feb 14th, ‘25 / 8 min read

High Cardinality Explained: The Basics Without the Jargon

Learn about high cardinality—when data has many unique values like user IDs or timestamps. Essential for analytics, but tricky to scale efficiently!

High Cardinality Explained: The Basics Without the Jargon

What is Cardinality?

Cardinality refers to the number of unique values in a dataset column. A column with many distinct values—like a user ID or timestamp—has high cardinality, while a column with limited distinct values—like a boolean flag (true/false) or a category with a few possible options—has low cardinality.

For example, consider a database of an e-commerce platform:

  • A column for order status (e.g., 'Pending', 'Shipped', 'Delivered', 'Cancelled') has low cardinality since there are only a few possible values.
  • A column for order IDs, where each order has a unique identifier like 'ORD12345678', has high cardinality since every row has a different value.

Low vs. High Cardinality

Low Cardinality

  • Definition: A column with fewer unique values.
  • Examples:
    • Gender: Male, Female
    • HTTP Status Codes: 200, 404, 500
    • Subscription Tiers: Free, Basic, Premium
  • Why it’s easy to handle: These values repeat often, making them efficient to index and compress, which speeds up searches and reduces storage costs.
💡
To efficiently handle high-cardinality data in observability, a robust telemetry data platform is essential—learn more about it here.

High Cardinality

  • Definition: A column with many unique values.
  • Examples:
    • User IDs in a global SaaS platform
    • IP addresses in server logs
    • Precise timestamps in real-time monitoring
  • Why it’s challenging: High-cardinality fields require more storage and make indexing, searching, and aggregations slower. Without optimization, they can bog down databases and monitoring tools.

High Cardinality: Pros and Cons

High cardinality plays a critical role in data modeling and impacts how efficiently data is stored, retrieved, and analyzed.

While high cardinality enables powerful analytics and personalization, it also presents challenges in terms of performance, storage, and observability, particularly in deployments handling large-scale time series data.

Benefits of High Cardinality

  1. Enables detailed and granular analytics
    • High cardinality datasets allow businesses to perform fine-grained analysis, such as tracking individual user behavior or monitoring unique events over time.
    • In dashboards used for observability, high cardinality data helps uncover hidden patterns by slicing data across multiple dimensions.
  2. Allows for better personalization and segmentation
    • Businesses can use high cardinality datasets to create highly specific customer segments, improving marketing, recommendation systems, and fraud detection.
    • Dimensional data in customer analytics ensures that each unique customer interaction is considered when making real-time decisions.
  3. Essential for tracking individual entities in large-scale applications
    • Many deployments in cloud infrastructure and docs on best practices emphasize that tracking users, sessions, or transactions at scale requires high cardinality datasets.
    • Systems monitoring high cardinality metrics, such as user IDs or unique request patterns, can provide deeper insights into performance and security.
💡
High-cardinality data can make distributed tracing complex and resource-intensive—learn more about the challenges of distributed tracing here.

Drawbacks of High Cardinality

  1. Slower query performance
    • High cardinality columns can degrade query speed, as indexing and searching through massive amounts of distinct values is resource-intensive.
    • In SQL databases, queries on high cardinality fields often require additional indexing strategies or partitioning to optimize performance.
  2. Increased storage costs
    • More unique values mean larger indexes, which can increase storage requirements significantly, particularly in time series data used for logging and monitoring.
    • Data structures used to store high cardinality data, such as inverted indexes or hash maps, require additional space and maintenance overhead.
  3. Complexity in observability
    • Monitoring systems like Prometheus struggle with high cardinality datasets because too many unique labels in high cardinality metrics cause increased memory consumption and slower dashboard rendering.
    • High cardinality in time series data makes it difficult to aggregate and visualize trends efficiently in real-time monitoring systems.

Impact on Data Systems

  1. SQL Databases
    • High cardinality in relational databases can cause performance issues, requiring advanced data modeling techniques like partitioning, indexing, or materialized views.
    • Dimensional data structures help optimize queries by reducing unnecessary lookups across large datasets.
  2. Time Series Databases (TSDBs)
    • Observability tools that rely on time series data, such as Prometheus and InfluxDB, struggle when labels have excessive cardinality, leading to performance bottlenecks.
    • Effective data structures, like downsampling techniques or pre-aggregated views, help mitigate high cardinality issues in time series monitoring.
💡
Managing high-cardinality data is easier with a single pane of glass approach—learn how it works here.

How to Measure Cardinality: High Degree Explained

When we talk about a high degree of cardinality, we’re referring to datasets with an extremely large number of unique values. But how do we measure it?

Ways to Measure Cardinality

  1. Counting Unique Values
    • The simplest way: Use SQL’s COUNT(DISTINCT column_name) to see how many unique values exist.
    • Example: SELECT COUNT(DISTINCT user_id) FROM users;
  2. Cardinality Estimation with Approximate Algorithms
    • HyperLogLog: A probabilistic algorithm that estimates the number of unique values in large datasets without needing full scans.
    • Bloom Filters: Helps check if a value exists before querying a dataset, reducing unnecessary lookups.
  3. Database Index Inspection
    • Many databases provide ways to check cardinality through system views or statistics:
    • In PostgreSQL: SELECT n_distinct FROM pg_stats WHERE tablename = 'your_table';

Examples of High Cardinality Data

  • User IDs in a global application: Every user has a unique identifier.
  • Timestamps in logs: Each log entry has a different timestamp, leading to high cardinality.
  • UUIDs and session tokens: These are almost always unique, making them highly cardinal.

Role of Cardinality in Observability and Metrics

  • Metric labels in Prometheus: Labels with high cardinality (e.g., user_id, request_id) can slow down queries and increase resource consumption.
  • OpenTelemetry and tracing: Spans with high cardinality tags can explode in volume, making trace aggregation difficult.
💡
Handling high-cardinality data effectively is key to unified observability—learn more about its impact here.

3 Important Cardinality Optimization Strategies

High cardinality can strain databases, increase storage costs, and slow down queries. Optimizing cardinality helps maintain performance, reduce overhead, and improve observability.

Here’s a detailed breakdown of key strategies:

1. Aggregation

What it is

Aggregation reduces the number of unique values by grouping similar data points. Instead of storing highly granular information, aggregated data provides summarized insights.

How it works

  • Time-based aggregation: Instead of storing raw logs with millisecond precision, aggregate them into minute- or hour-level summaries.
  • Metric aggregation: In observability, logs, and metrics can be combined using statistical functions like SUM, AVERAGE, MIN/MAX, and high-cardinality COUNT to reduce dataset size while retaining valuable insights.
  • Business logic-based aggregation: For transactional systems, grouping sales data by region or product category instead of storing every transaction individually reduces high cardinality issues.

Example

Imagine a dataset tracking website visitors, logging each page view with timestamps. Instead of storing every timestamp, aggregate visits by the hour and store just the total visit count for each time window.

2. Sampling

What it is

Sampling selectively drops data points while retaining enough information to make accurate inferences. This technique is useful when high cardinality data is overwhelming but exact precision isn’t necessary.

How it works

  • Random sampling: A fraction of the data is retained while discarding the rest (e.g., logging only 10% of non-critical API requests).
  • Adaptive sampling: More important data points (like errors or anomalies) are retained at a higher rate, while normal events are logged less frequently.
  • Reservoir sampling: A fixed-size sample is maintained from a data stream, ensuring a representative subset is retained.

Example

A monitoring system tracking application requests might collect logs for every user interaction. Instead of storing every log entry, it could keep logs for every 1 in 100 successful requests but retain all errors to maintain system health insights.

💡
High-cardinality data is a key challenge in LLM observability. Discover how to handle it efficiently here.

3. Partitioning and Indexing

What it is

Partitioning and indexing optimize how data is stored and retrieved, making high-cardinality datasets more manageable.

Partitioning Strategies

  • Time-based partitioning: Splitting logs or metrics into daily, weekly, or monthly partitions ensures that queries only scan relevant partitions, improving retrieval speed.
  • Key-based partitioning: Data can be split based on user ID, region, or category to avoid scanning an entire dataset.
  • Hybrid partitioning: Combining time-based and key-based partitioning improves efficiency in large-scale systems.

Indexing Strategies

  • Composite indexes: Combining multiple columns into a single index reduces the need for multiple scans.
  • Partial indexes: Indexing only frequently queried data instead of the full dataset minimizes storage overhead.
  • Bloom filters: Used in distributed systems, these help efficiently check for the presence of a value before querying large datasets.

Example

A finance application storing millions of transactions per day can partition data by month and index it by user ID. This setup ensures that queries fetching user-specific transactions only scan a small fraction of the database.

Practical Applications and Challenges with High Cardinality

While High cardinality data provides valuable insights, handling it efficiently is key to keeping systems fast and scalable.

Challenges in Data Analysis

  • Slow queries: When a database has to scan entire tables because indexes can’t keep up, queries drag, affecting performance.
  • Exploding metric cardinality: Monitoring tools struggle when faced with too many unique labels, leading to sluggish dashboards and increased storage costs.
  • Scalability issues: Real-time data streams can quickly become unmanageable if cardinality isn't controlled, causing delays and bottlenecks.

Techniques for Scalability

  • Bloom filters and sketches: These clever data structures estimate cardinality without storing every unique value, saving space and speeding up lookups.
  • Downsampling: Instead of keeping every raw data point, summarize it—think of storing hourly metrics instead of every second.
  • Optimized databases: Open-source solutions like ClickHouse and TimescaleDB are built to handle high-cardinality workloads efficiently.

Final Thoughts

High cardinality isn’t inherently bad—it’s just a challenge that needs the right approach. If you’re dealing with analytics or large-scale applications, understanding how to manage high cardinality effectively can significantly impact performance and scalability.

And, if you're looking for a managed observability solution that can help you with high cardinality without increasing your observability budgets, Last9 is the right tool for you.

Last9’s Telemetry Warehouse now supports Logs and Traces
Last9’s Telemetry Warehouse now supports Logs and Traces


Schedule a demo with us to know more or just try it for free!

FAQs

What is high cardinality and low cardinality?

Cardinality refers to the number of unique values in a dataset column.

  • High cardinality means a column has many distinct values (e.g., user IDs, timestamps).
  • Low cardinality means a column has fewer unique values (e.g., gender, status codes).

Is high cardinality good or bad?

It depends on the use case.

  • Good: Provides detailed insights, essential for analytics and tracking unique entities.
  • Bad: Can slow down queries, increase storage costs, and create challenges in monitoring systems.

What is a high degree of cardinality?

A dataset has a high degree of cardinality when a column contains a vast number of unique values, making indexing, searching, and storing more complex.

What is the meaning of cardinality?

Cardinality represents the number of distinct values in a dataset attribute. In databases, it affects query performance, storage, and indexing efficiency.

What is an example of high cardinality?

Examples include:

  • User IDs in a global SaaS application
  • Precise timestamps in log files
  • Transaction IDs in financial systems

What is high cardinality?

High cardinality refers to datasets where a field has many unique values. It is common in observability, analytics, and large-scale applications.

Do observability tools handle high cardinality issues?

Yes! Observability platforms like Last9 are designed to handle high cardinality efficiently, reducing noise while improving monitoring precision.

What is high cardinality, and how do time-series databases like InfluxDB and TimescaleDB compare?

High cardinality in time-series databases means storing a large number of unique metric labels.

  • InfluxDB can struggle with high cardinality due to its storage model.
  • TimescaleDB handles it better by using PostgreSQL’s indexing and partitioning strategies.

Why is understanding high-cardinality data important?

It helps optimize performance, reduce costs, and improve observability by preventing metric explosions in monitoring systems like Prometheus.

What are the best tools and techniques for data transfer rate optimization?

  • Compression algorithms (Snappy, Zstandard)
  • Data aggregation and downsampling
  • Efficient indexing strategies
  • Partitioning and sharding in distributed databases

Do you need every unique value that you’re storing?

Not always. Sampling, aggregation, and approximation techniques (like HyperLogLog) can help reduce storage and processing overhead.

How does high cardinality impact database performance?

  • Increases index size, slowing down lookups
  • Makes joins and aggregations expensive
  • Consumes more storage and memory resources

How does high cardinality affect database performance and query optimization?

  • Indexing challenges: Large indexes slow down retrieval.
  • Query slowness: More distinct values mean expensive scans.
  • Storage overhead: More unique values increase disk usage.
  • Optimization techniques: Partitioning, indexing, and caching strategies help mitigate performance issues.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I love breaking down complex concepts into easy-to-understand terms.