High Cardinality Is Eating Your Storage Budget—Here’s Why

Have you noticed your storage costs rising even when you're keeping an eye on them? The reason might be something easy to overlook: high cardinality data. For data engineers and developers balancing performance and costs, understanding its impact isn’t just useful—it’s key to avoiding unnecessary spending and system slowdowns.

Let’s break down what high cardinality means for your infrastructure, how it affects performance, and why it matters for your budget—without the fluff, just the technical depth needed to address the problem.

What Makes Data "High Cardinality" in Technical Terms

High cardinality occurs when a data field contains a massive number of unique values relative to the total record count. Think user IDs, email addresses, or timestamps – columns where almost every entry is different from the others.

The cardinality of a column is mathematically represented as:

Cardinality Ratio = Number of Distinct Values / Total Number of Records

When this ratio approaches 1.0, you're dealing with a perfect high cardinality field. Common examples include:

UUID/GUID identifiers: 128-bit values with 2^128 possible combinations
Session IDs: Typically unique per user interaction
IP addresses: While technically finite, they behave as high cardinality in most datasets
Full URLs with query parameters: Can create millions of unique values
Precise timestamps: Down to millisecond precision

While your UUID column might seem innocent enough, it's secretly the data equivalent of that friend who orders the most expensive item on the menu and then suggests splitting the bill evenly.

💡

High cardinality can be tricky, but it doesn’t have to be. Here’s a straightforward explanation without the jargon: High Cardinality Explained.

The Exponential Storage Tax on High Cardinality Data

When high cardinality data takes over, your storage costs don't just increase – they explode exponentially with scale. Here's exactly why and how to calculate the impact:

B-Tree Index Explosion:

Relational databases typically use B-tree indexes to speed up queries. With high cardinality fields, these indexes become massively inefficient:

Index Entry Overhead: Each unique value requires its own entry in the B-tree
Pointer Proliferation: Each entry contains pointers to actual data locations
Balancing Overhead: B-trees require additional space for maintaining balance
Fragmentation Issues: High-churn cardinality data creates significant fragmentation

In PostgreSQL, for example, each index entry typically consumes:

8 bytes for the indexed value pointer
6 bytes for the tuple identifier (TID)
2-4 bytes for structural overhead

For a table with 10 million rows and a high cardinality column, the index alone could consume:

10,000,000 × (8 + 6 + 3) bytes ≈ 170 MB raw, often growing to 250-300 MB with overhead

A typical customer database with low cardinality fields might have indexes that consume 10-15% of the total storage. Add high cardinality fields to the mix, and those indexes can balloon to 50% or more of your total storage footprint.

Entropy-Based Compression Failure:

High cardinality data fundamentally breaks compression algorithms because it increases data entropy. Shannon's information theory tells us that the theoretical compression limit is tied to entropy:

Entropy = -∑(p(x) × log₂(p(x)))

Where p(x) is the probability of a given value occurring.

With high cardinality fields, p(x) approaches 1/n for all values, maximizing entropy and minimizing compressibility.

Cardinality Level	Entropy	Typical Compression Ratio	Storage Required for 1TB Raw Data	Annual Cost (S3 Standard)
Low (10³ values)	9.97	8:1 to 12:1	83-125GB	$23-35/month
Medium (10⁵)	16.61	3:1 to 6:1	167-333GB	$47-93/month
High (10⁷)	23.25	1.3:1 to 2.5:1	400-769GB	$112-215/month
Extreme (10⁹)	29.90	1.1:1 to 1.8:1	556-909GB	$156-255/month

As this table shows, the storage requirement difference between low and extreme cardinality can be 7-11x for the same data volume, with direct cost implications.

💡

Understanding the difference between high and low cardinality is key to building an efficient observability stack. Learn more here: High vs. Low Cardinality.

Cascading Infrastructure Costs:

The pain doesn't end with primary storage. Every byte of inefficiently stored data cascades through your entire data infrastructure:

Backup Amplification: 3-5x primary storage for proper backup strategy
Replication Overhead: 2-3x for high-availability configurations
Network Transfer Costs: Intra-region AWS data transfer (~$0.01/GB) adds up when moving terabytes
Memory Requirements: High cardinality index caching requires additional RAM
CPU Utilization: More data = more processing time = higher compute costs

In a typical enterprise setup, the true cost multiplier for high cardinality data is often 7-10x the base storage cost when all factors are considered.

Performance Degradation Vectors and Their Impact Beyond Storage

Cost isn't the only casualty of high cardinality – performance suffers across multiple vectors, often in ways that aren't immediately obvious:

Why Your Database Makes Poor Execution Decisions

When you run analytics on high cardinality fields, query optimizers struggle with cardinality estimation, leading to catastrophic execution plans:

-- This query might run efficiently with low cardinality department_id
-- But will likely cause problems with high cardinality user_id
SELECT AVG(amount) 
FROM transactions 
WHERE user_id IN (SELECT user_id FROM users WHERE signup_date > '2023-01-01')

Your database optimizer might make these suboptimal choices:

Statistics Sampling Issues: Most optimizers use sampling to estimate cardinality, which becomes less accurate with high cardinality data
Join Strategy Mistakes: Choosing nested loops joins when hash joins would be more efficient
Bitmap Index Scan Overhead: Creating enormous bitmap indexes in memory
Sort Operation Explosion: Allocating excessive memory for sorting operations

PostgreSQL query execution times vary by 100x or more just based on optimizer estimation errors caused by high cardinality fields.

💡

Managing PostgreSQL logs effectively is key to troubleshooting and performance tuning. Here’s a practical guide to get started: Postgres Logs 101.

Profiling Resource-Intensive Operations

High cardinality data creates specific memory pressure points:

Hash Tables Growth: Operations like GROUP BY on high cardinality fields create massive hash tables
Distinct Value Tracking: COUNT(DISTINCT user_id) must track millions of values
Sort Buffer Explosions: Order by operations require full materialization
Aggregate Temporary Storage: Computing statistics across high cardinality dimensions

This query looks innocent but can bring down your database:

SELECT device_type, COUNT(DISTINCT user_id) 
FROM sessions 
WHERE created_at > CURRENT_DATE - INTERVAL '30 days'
GROUP BY device_type

With millions of unique user_ids, this requires:

Tracking each unique ID (16 bytes each)
Hash table overhead (typically 1.5x the data size)
Temporary buffer allocations

Resulting in: 1M unique IDs × 16 bytes × 1.5 overhead = ~24MB memory usage

Multiply this across concurrent queries and you'll quickly exhaust available RAM.

How to Tackle High Cardinality with Better Architecture

No need to resign yourself to ever-growing AWS costs. Here are practical strategies with implementation details to help you take control.

Strategic Denormalization and Surrogate Key Implementation

Sometimes the answer isn't normalizing more but denormalizing strategically:

-- Instead of using natural keys directly
CREATE TABLE user_events (
    id SERIAL PRIMARY KEY,
    user_email VARCHAR(255), -- High cardinality
    event_type VARCHAR(50),
    created_at TIMESTAMP
);

-- Create a surrogate key mapping table
CREATE TABLE users (
    user_id SERIAL PRIMARY KEY,
    user_email VARCHAR(255) UNIQUE
);

-- Use the surrogate key in the main table
CREATE TABLE user_events_optimized (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(user_id), -- Lower cardinality
    event_type VARCHAR(50),
    created_at TIMESTAMP
);

This approach reduces the storage requirements for the events table and improves both compression and indexing efficiency.

Smart Bucketing for Balanced Precision and Granularity

Group high cardinality values into logical buckets with carefully calculated precision loss:

-- Instead of storing precise timestamps
timestamp_ms BIGINT -- High cardinality, poor compression

-- Store bucketed timestamps
timestamp_hour TIMESTAMP -- Truncated to hour, reduces cardinality by factor of 3600

Implementation examples:

Temporal Bucketing: DATE_TRUNC('hour', timestamp) in PostgreSQL
Numeric Range Bucketing: FLOOR(value/bucket_size)*bucket_size
String Hashing: MD5(long_string) -> SUBSTRING(hash, 1, precision_length)

The key is calculating appropriate bucket sizes that maintain analytical utility while reducing cardinality.

💡

High monitoring costs aren’t always about data volume—cardinality plays a big role. Here’s why and what you can do about it: Why Your Monitoring Costs Are High.

How Can Schema Design Improve Columnar Storage Efficiency

For analytics workloads, switching to columnar storage formats provides dramatic benefits:

Apache Parquet: Implements run-length encoding, dictionary encoding, and bit-packing
Apache ORC: Provides string dictionary encoding and run-length encoding
ClickHouse: Uses specialized encodings for specific data types

Implementation example with Parquet:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

# Create DataFrame with high cardinality data
df = pd.DataFrame({
    'user_id': [f'user_{i}' for i in range(1000000)],
    'timestamp': pd.date_range('2023-01-01', periods=1000000, freq='S'),
    'value': range(1000000)
})

# Convert to Arrow Table with dictionary encoding for high cardinality columns
table = pa.Table.from_pandas(df, preserve_index=False)

# Write with optimized compression and encoding
pq.write_table(
    table, 
    'optimized_data.parquet',
    compression='zstd',
    compression_level=3,
    use_dictionary=['user_id', 'timestamp'],  # Force dictionary encoding
    version='2.6',
    data_page_size=1048576  # 1MB pages
)

This approach can achieve 3-4x better compression ratios for high cardinality data compared to row-oriented storage.

Time-Series Partitioning with Hybrid Storage Models

If you're dealing with time-series data, consider specialized databases with partitioning strategies:

-- TimescaleDB hypertable creation example
CREATE TABLE metrics (
    time TIMESTAMP NOT NULL,
    device_id TEXT NOT NULL,  -- High cardinality
    sensor_id TEXT NOT NULL,  -- High cardinality
    value DOUBLE PRECISION
);

-- Convert to hypertable with time partitioning
SELECT create_hypertable('metrics', 'time', 
                         chunk_time_interval => INTERVAL '1 day');

-- Create a compound index that leverages partitioning
CREATE INDEX idx_metrics_device ON metrics (device_id, time DESC);

This approach segments the high cardinality data across partitions, limiting the impact of any single high cardinality field on the entire dataset.

Advanced Optimization Techniques For Maximum Efficiency

For those needing to push the boundaries of optimization:

Bloom Filter Implementation for Membership Testing

Bloom filters provide probabilistic set membership testing with minimal storage:

from pybloom_live import BloomFilter

# Create a Bloom filter with 1M elements and 1% error rate
bf = BloomFilter(capacity=1000000, error_rate=0.01)

# Add high cardinality values
for user_id in user_ids:
    bf.add(user_id)

# Test membership (false positives possible, but no false negatives)
if 'user_12345' in bf:
    # Possible match, verify with exact lookup
    pass

A Bloom filter can reduce storage requirements by 90-95% compared to storing the original high cardinality values when exact matches aren't required.

Probabilistic Data Structures for Approximate Analytics

For many analytics use cases, approximate answers are sufficient:

HyperLogLog: Count distinct values with 2-5% error using minimal memory
Count-Min Sketch: Frequency estimation for high cardinality dimensions
T-Digest: Approximate percentiles with high accuracy at the tails

Implementation in Redis:

# Using Redis HyperLogLog for user counting
> PFADD users user1 user2 user3
(integer) 1
> PFADD users user4 user5
(integer) 1
> PFCOUNT users
(integer) 5

This requires just 12KB of memory regardless of whether you're counting 5 users or 5 billion users.

💡

Handling high cardinality data efficiently requires the right storage approach. Here’s why thinking like a data warehouse, not a database, matters: Think Data Warehouse, Not Database.

The Efficiency Ripple Effect Across Your Stack

Addressing high cardinality issues doesn't just save on storage – it creates a cascade of benefits throughout your entire data pipeline:

Development Velocity Improvements: Faster Iteration Cycles

Reduced Test Data Requirements: Testing with smaller datasets becomes feasible
Simplified Schema Management: Fewer index optimizations needed
Faster CI/CD Pipelines: Data operations complete more quickly
Improved Developer Experience: Queries against local databases return in seconds rather than minutes

End-User Experience Enhancements: Measurable Performance Gains

Dashboard Rendering Speed: Reports load 3-5x faster
API Response Times: Typically 40-60% reduction in p95 latency
Export Performance: Data extraction jobs complete in minutes vs hours
Real-Time Analytics Possibilities: Previously impossible calculations become feasible

Architectural Simplification: Reduced Infrastructure Complexity

Less Reliance on Caching: Primary queries fast enough to reduce cache dependencies
Fewer Specialized Workarounds: Less need for summary tables and materialized views
Simplified Scaling Strategy: Vertical scaling becomes more effective
Extended Hardware Lifecycles: Existing infrastructure supports more users/data

What’s the Best Way to Tackle High Cardinality Challenges?

The path to optimizing high cardinality data should follow this prioritized approach:

Audit and Identify:
- Run cardinality analysis on all database columns
- Focus on columns with cardinality ratio > 0.7
- Prioritize by storage impact and query frequency
Quick Wins First:
- Implement appropriate index types (Hash vs B-tree)
- Apply proper column data types and lengths
- Adjust statistics sampling for better query planning
Schema Evolution:
- Implement surrogate keys for high cardinality natural keys
- Create appropriate bucketing strategies
- Consider vertical partitioning for mixed-cardinality tables
Storage Technology Alignment:
- Evaluate columnar formats for analytical data
- Consider specialized databases for specific workloads
- Implement time-series specific optimizations
Continuous Monitoring:
- Implement cardinality drift detection
- Set up alerting for new high cardinality fields
- Create storage efficiency dashboards

💡

High cardinality metrics can strain Prometheus if not managed well. Here’s how to keep your monitoring efficient: Managing High Cardinality Metrics in Prometheus.

Why Technical Optimization Is a Business Necessity

High cardinality's impact on data storage efficiency isn't just a technical issue – it's a business problem with direct P&L implications:

Cloud Cost Optimization: Storage often represents 40-60% of cloud data platform costs
Infrastructure Investment Deferral: Optimizing existing data extends hardware lifecycles
Performance as Competitive Advantage: Faster systems = better user experience = higher retention
Engineering Resource Allocation: Less time fighting performance fires = more time building features

Wrapping Up

The choice is simple: tackle high cardinality issues now, or keep writing increasingly larger checks to your storage provider while watching your systems slow to a crawl.

Last9 makes handling high cardinality easy. Out of the 20 biggest livestreamed events ever, 12 used Last9 for monitoring—proof that it scales when it counts.

Use high cardinality and dimensionality data without slowing down your monitoring. No more sluggish dashboards or noisy alerts. With faster queries, smarter workflows, and better alerting, high-cardinality data works for you, not against you.

If high cardinality data is eating into your budget and peace of mind, book some time with us!

💡

And if you ever want to continue the conversation, our Discord community is open. There's a dedicated channel where you can discuss your use case with other developers.