Why High-Cardinality Metrics Break Everything

High-cardinality metrics are one of those ideas that sound obviously right - until you try to use them in production.

In theory, they promise precision. Instead of averages and rollups, you get specificity: per-request, per-userid, per-container, per-feature insights. The kind of detail we all immediately want when something is on fire.

And then things start breaking. Not immediately. Not loudly.But quietly. Often in ways that feel like a mysterious bug until you realize the system itself was never designed for this shape of data.

This isn't a post about why high-cardinality metrics are useful. If you're here, you already know. You might already be in pain.

It’s about what actually breaks when teams add them and why those failures are hard to avoid unless the system is built for it.

A Familiar Failure Pattern

One team we worked with triggered it accidentally through a small, reasonable change. They added a user_segment label to their request metrics to understand behavior across cohorts.

Within 48 hours, their active time-series count jumped from ~60,000 to over 1.1 million.

Nothing crashed. Dashboards still loaded. Alerts still fired. Deployments continued as usual. They only noticed when finance asked why the observability bill had tripled; without any material change in the business.

We’ve made versions of this mistake ourselves. The painful part isn’t the spike; it’s realizing too late that nothing in the system warned you it was coming.

That’s the dangerous part: when high-cardinality breaks things, it rarely announces itself.

Surprises are fine in movies. They’re not fine on invoices; or while debugging a P0.

Break #1: Cost Stops Being Predictable

The first thing teams notice is the bill. It’s high - yes. But more importantly it's unexplainable.

You expect cost to scale with:

number of metrics
scrape interval or
retention

With high cardinality, cost scales with behavior.

A new label added in a deployment. A feature flag enabled for 5% of users. A single tenant with unusually shaped traffic.

None of these look risky in code review. All of them can multiply time-series counts overnight.

Why this is risky

We all hate being surprised by consequences we didn’t explicitly code and finance teams hate systems where cost is an emergent property of runtime behavior.

And metric cardinality systems are especially bad at this because the feedback loop is slow: the bill arrives after the damage is done.

Whether you're running Prometheus, a managed TSDB, or a vendor platform, the pattern is the same. (If you're on Datadog or similar, you've seen this called "custom metrics." Same problem, friendlier billing name.)

Why this happens (systems-level)

Here’s what’s happening underneath: every unique label combination creates a distinct time series. Each time series requires:

separate storage allocation (chunks or blocks on disk)
its own index entries so it can be queried efficiently
memory during ingestion to keep writes fast
write-ahead log entries and compaction work

When your labels multiply, you’re storing more data and multiplying the bookkeeping.That’s why most vendors price by series count: cardinality is what actually drives storage, memory, and write amplification-not request volume or raw bytes.

How teams avoid this

A useful mental model to adopt is:

If a label’s possible values can’t be listed on a whiteboard, it probably doesn’t belong on a metric without guardrails.

That forces teams to distinguish between:

bounded dimensions (safe)
unbounded but intentional dimensions (dangerous but acceptable)
accidental dimensions (bugs waiting to happen)

Some teams add cardinality estimation before deploy- pre-merge checks that approximate how many new series a change might introduce. (We’ll show exactly how teams build this in Part 2.)

The goal is less surprises and more predictability.

The problem isn’t high cardinality by itself; it’s unbounded, accidental cardinality that grows without anyone noticing.

Break #2: Queries Get Slower Exactly When You Need Them Fast

Scoping for slow queries in time series data — Slow queries due to scoping limitation in time series

During an incident, high-cardinality metrics tend to fail in a very specific way.

Suddenly:

Cause dashboards to take minutes instead of milliseconds
exploratory queries time out
engineers fall back to logs just to stay unblocked

Why this is risky

Observability is supposed to reduce cognitive load during incidents. When queries slow down under pressure, engineers stop trusting the system as a primary tool.

Several folks on r/devops phrased it succinctly:

when everything becomes an index, nothing really is.

They don’t switch because logs are better. They switch because logs are predictable.

Why this happens (systems-level)

At the query layer, the math is simple and unforgiving.

Every unique time series that matches a query must be:

located via the index
read from disk or cache
decompressed
aggregated

With low cardinality, that’s thousands of series. With high cardinality, it’s hundreds of thousands (mostly millions).

High-cardinality filters also collapse index selectivity. A filter like user_id=X doesn’t narrow the search space meaningfully if user_id has a million possible values. During incidents, engineers widen time ranges and loosen filters, making scan costs explode at exactly the worst moment.

The query engine isn’t slow. It’s doing exactly what you asked-across far more data than you realized you’d created.

And this is the key distinction: high-cardinality data isn’t an ingest problem.

Most systems can ingest and store it just fine. It’s a query-time problem, when that data has to be searched, joined, and aggregated under pressure and finite compute.

How teams avoid this

Teams that survive high cardinality design for fast narrowing, not brute-force scans.

That usually means:

pre-aggregating known questions early
preserving raw cardinality for investigation paths
making ingest-time decisions instead of deferring everything to query time

(We’ll walk through concrete ingest and recording-rule patterns in Part 2.)

High cardinality only works when systems are optimized for interactive exploration under pressure, not worst-case scans.

Break #3: Engineers Lose Trust in the Data

slow grafana due to high cardinality — cardinality causing unreliable dashboards

This is the most dangerous break and the hardest to notice and fix.

As cardinality grows:

charts flicker
time series appear and disappear
queries return inconsistent shapes

Engineers start saying things like:

“Let me check the logs instead.”
“Is this dashboard even current?”
“That number doesn’t look right - let me re-run it.”

Why this is risky

Observability doesn’t fail when data is missing. It fails when engineers stop believing what they see.

Once trust erodes, adoption collapses quietly; and it almost never recovers.

Why this happens (systems-level)

High-cardinality series tend to be sparse and short-lived.

A container spins up, emits metrics for twenty minutes, then disappears. The series doesn’t end-it just stops receiving data. Now the query engine has to decide: is that a zero, a gap, or a stale series?

Multiply that ambiguity across thousands of ephemeral series and add:

runtime label changes
partial scrapes under load
sampling or eviction when limits are hit

The result is charts that flicker and queries that return different shapes for the same question not because the system is wrong, but because the data is ambiguous.

How teams avoid this

Teams that maintain trust make uncertainty explicit:

dashboards distinguish zero vs missing vs filtered-out
sampling and aggregation are visible, not hidden
schema changes are intentional, not accidental

When engineers understand why a chart looks the way it does, they’ll trust it - even when it’s imperfect.

Understanding and explainability of behaviour is very crucial to build trust.

Understanding and explaining system behavior is not simply about UX. It's the core foundation of trust.

Break #4: Teams Over-Instrument Without Knowing Why

Cardinality is multiplicate — Cardinality is multiplicative.

High cardinality lowers the friction to add “just one more label.” So teams do.

Over time:

instrumentation becomes inconsistent
the same concept is modeled differently across services or business units
no one remembers why certain dimensions exist

The result isn’t better observability. It’s technical debt presenting itself operationally.

Why this is risky

Teams start to feel it before they can explain it.

Queries don’t correlate the way they should. Dashboards feel brittle. Engineers develop a vague sense that “something is off,” but can’t point to a single place to fix it; or even agree on what the fix should be.

At that point, refactoring telemetry feels riskier than leaving it alone. There’s no single place to inspect and change it.

Why this happens (systems-level)

The math is unforgiving: cardinality is multiplicative, not additive.

Five labels with ten possible values each isn’t fifty series. It’s 10⁵ - one hundred thousand.

Add a sixth label with a hundred values and you’re suddenly at 10 million possible combinations.

Metrics systems make this easy to do accidentally. They’re append-only: adding labels is cheap, but removing them is expensive. Old series linger. Queries keep working; until they don’t.

So entropy wins, and the label space grows faster than anyone intended.

How teams avoid this

Mature teams introduce intent and ownership:

high-cardinality labels have owners
dimensions exist for a reason, not “just in case”
telemetry evolves deliberately, not accidentally

High-cardinality metrics work best when treated like APIs: designed, reviewed, and occasionally removed.

The Real Risk Is Blindness.

Most teams don’t struggle with high-cardinality metrics because the idea is wrong.

They struggle because:

cost behavior is opaque
performance tradeoffs are hidden
data trust erodes silently
modeling decisions aren’t explicit

This is why high-cardinality observability feels risky-not because it’s unstable, but because the system doesn’t surface its own limits.

Teams that succeed don’t “turn it on.” They design for it.

They assume cardinality will exist: and build guardrails, visibility, and intent around it.

Why This Matters When Choosing an Observability Platform

If a platform treats high cardinality as:

a pricing tier → risk stays with you
a query problem → incidents get harder
a sampling problem → signal gets lost

Then the risk hasn’t gone away, it is being deferred. The safer path is tooling that assumes: Teams will add high-cardinality data whether you like it or not.

And is designed to make that reality:

predictable
explainable
controllable

That’s not about features. It’s about whether the system matches how modern teams actually operate.

The Quiet Takeaway

High-cardinality metrics still sound obviously right. The real question is whether your observability system was built to agree.

Before you add the next label, ask yourself:

Do I know what this will cost, how it will perform, and whether my team will still trust it six months from now?

If the answer is “I’m not sure,” that’s not a failure of discipline. It’s a signal that the system wasn’t designed for the data you actually have.

FAQ Section (to add at end of post before "Coming next"):

Frequently Asked Questions

What exactly is high cardinality in metrics?

Cardinality is the number of unique time series your system creates. If you have a metric http_requests with labels for status_code, endpoint, and customer_id, your cardinality is the product of all possible combinations. A 100 endpoints × 5 status codes × 10,000 customers = 5 M time series. That's where things start breaking.

Why do high cardinality metrics cost so much?

Every unique time series is stored separately. Most observability backends charge by time series count or storage volume. When cardinality explodes, you're not just paying for more data, you're paying for the index overhead, query processing, and memory required to track millions of series. The cost scales non-linearly.

Can I just add more hardware to handle high cardinality?

Throwing hardware at cardinality problems is expensive and often doesn't work. The bottleneck isn't usually CPU or disk, it's the query planner trying to figure out which series to scan. You need systems designed for high cardinality from the ground up, not retrofitted solutions.

How do I know if high cardinality is affecting my system?

Watch for these signs: dashboards timing out during incidents, alert rules failing silently, cost spikes that don't correlate with traffic, and engineers adding limit clauses to every query. If your team has stopped trusting the monitoring data, cardinality is probably why.

Coming next

In Part 2, we’ll get concrete.

CI checks that flag exploding labels. A Storage Cardinality Explorer.Streaming aggregations and ingest patterns that keep queries fast.Dashboard configurations that make “no data” explicit instead of misleading.

No theory - just the implementation patterns teams actually rely on to make high-cardinality metrics safe.

Update: Want to understand how Prometheus and ClickHouse handle cardinality differently under the hood? Read High Cardinality Metrics: How Prometheus and ClickHouse Handle Scale.

Why High-Cardinality Metrics Break Everything

Contents

A Familiar Failure Pattern

Break #1: Cost Stops Being Predictable

Why this is risky

Why this happens (systems-level)

How teams avoid this

Break #2: Queries Get Slower Exactly When You Need Them Fast

Why this is risky

Why this happens (systems-level)

How teams avoid this

Break #3: Engineers Lose Trust in the Data

Why this is risky

Why this happens (systems-level)

How teams avoid this

Break #4: Teams Over-Instrument Without Knowing Why

Why this is risky

Why this happens (systems-level)

How teams avoid this

The Real Risk Is Blindness.

Why This Matters When Choosing an Observability Platform

The Quiet Takeaway

Frequently Asked Questions

What exactly is high cardinality in metrics?

Why do high cardinality metrics cost so much?

Can I just add more hardware to handle high cardinality?

How do I know if high cardinality is affecting my system?

Coming next

Contents

Start observing for free. No lock-in.