Last9 named a Gartner Cool Vendor in AI for SRE Observability for 2025! Read more →
Last9

Why High-Cardinality Metrics Break Everything

What actually breaks when teams add high cardinality metrics and why those failures are hard to avoid unless the system is built for it.

Dec 31st, ‘25
High Cardinality Metrics - Control Room
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

High-cardinality metrics are one of those ideas that sound obviously right - until you try to use them in production.

In theory, they promise precision. Instead of averages and rollups, you get specificity: per-request, per-userid, per-container, per-feature insights. The kind of detail we all immediately want when something is on fire.

And then things start breaking. Not immediately. Not loudly.But quietly. Often in ways that feel like a mysterious bug until you realize the system itself was never designed for this shape of data.

This isn't a post about why high-cardinality metrics are useful. If you're here, you already know. You might already be in pain.

It’s about what actually breaks when teams add them and why those failures are hard to avoid unless the system is built for it.

A Familiar Failure Pattern

One team we worked with triggered it accidentally through a small, reasonable change. They added a user_segment label to their request metrics to understand behavior across cohorts.

Within 48 hours, their active time-series count jumped from ~60,000 to over 1.1 million

Nothing crashed. Dashboards still loaded. Alerts still fired. Deployments continued as usual. They only noticed when finance asked why the observability bill had tripled; without any material change in the business.

We’ve made versions of this mistake ourselves. The painful part isn’t the spike; it’s realizing too late that nothing in the system warned you it was coming.

That’s the dangerous part: when high-cardinality breaks things, it rarely announces itself. 

Surprises are fine in movies. They’re not fine on invoices; or while debugging a P0.

Break #1: Cost Stops Being Predictable

Custom Metric surprises

The first thing teams notice is the bill. It’s high - yes. But more importantly it's unexplainable. 

You expect cost to scale with:

  • number of metrics
  • scrape interval or
  • retention

With high cardinality, cost scales with behavior.

A new label added in a deployment.  A feature flag enabled for 5% of users. A single tenant with unusually shaped traffic.

None of these look risky in code review. All of them can multiply time-series counts overnight.

Why this is risky

We all hate being surprised by consequences we didn’t explicitly code and finance teams hate systems where cost is an emergent property of runtime behavior.

And metric cardinality systems are especially bad at this because the feedback loop is slow: the bill arrives after the damage is done.

Whether you're running Prometheus, a managed TSDB, or a vendor platform, the pattern is the same. (If you're on Datadog or similar, you've seen this called "custom metrics." Same problem, friendlier billing name.)

Why this happens (systems-level)

Here’s what’s happening underneath: every unique label combination creates a distinct time series. Each time series requires:

  • separate storage allocation (chunks or blocks on disk)
  • its own index entries so it can be queried efficiently
  • memory during ingestion to keep writes fast
  • write-ahead log entries and compaction work

When your labels multiply, you’re storing more data and multiplying the bookkeeping.That’s why most vendors price by series count: cardinality is what actually drives storage, memory, and write amplification-not request volume or raw bytes.

How teams avoid this

A useful mental model to adopt is:

If a label’s possible values can’t be listed on a whiteboard, it probably doesn’t belong on a metric without guardrails.

That forces teams to distinguish between:

  • bounded dimensions (safe)
  • unbounded but intentional dimensions (dangerous but acceptable)
  • accidental dimensions (bugs waiting to happen)

Some teams add cardinality estimation before deploy- pre-merge checks that approximate how many new series a change might introduce.  (We’ll show exactly how teams build this in Part 2.)

The goal is less surprises and more predictability

Break #2: Queries Get Slower Exactly When You Need Them Fast

Slow queries due to scoping limitation in time series

During an incident, high-cardinality metrics tend to fail in a very specific way.

Suddenly:

  • Cause dashboards to take minutes instead of  milliseconds
  • exploratory queries time out
  • engineers fall back to logs just to stay unblocked

Why this is risky

Observability is supposed to reduce cognitive load during incidents.  When queries slow down under pressure, engineers stop trusting the system as a primary tool.

They don’t switch because logs are better.  They switch because logs are predictable.

Why this happens (systems-level)

At the query layer, the math is simple and unforgiving.

Every unique time series that matches a query must be:

  • located via the index
  • read from disk or cache
  • decompressed
  • aggregated

With low cardinality, that’s thousands of series.  With high cardinality, it’s hundreds of thousands (mostly millions).

High-cardinality filters also collapse index selectivity. A filter like user_id=X doesn’t narrow the search space meaningfully if user_id has a million possible values. During incidents, engineers widen time ranges and loosen filters, making scan costs explode at exactly the worst moment.

The query engine isn’t slow.  It’s doing exactly what you asked-across far more data than you realized you’d created.

And this is the key distinction: high-cardinality data isn’t an ingest problem.

Most systems can ingest and store it just fine. It’s a query-time problem, when that data has to be searched, joined, and aggregated under pressure and finite compute.

How teams avoid this

Teams that survive high cardinality design for fast narrowing, not brute-force scans.

That usually means:

  • pre-aggregating known questions early
  • preserving raw cardinality for investigation paths
  • making ingest-time decisions instead of deferring everything to query time

(We’ll walk through concrete ingest and recording-rule patterns in Part 2.)

High cardinality only works when systems are optimized for interactive exploration under pressure, not worst-case scans.

Break #3: Engineers Lose Trust in the Data

cardinality causing unreliable dashboards

This is the most dangerous break and the hardest to notice and fix.

As cardinality grows:

  • charts flicker
  • time series appear and disappear
  • queries return inconsistent shapes

Engineers start saying things like:

  • “Let me check the logs instead.”
  • “Is this dashboard even current?”
  • “That number doesn’t look right - let me re-run it.”

Why this is risky

Observability doesn’t fail when data is missing.  It fails when engineers stop believing what they see.

Once trust erodes, adoption collapses quietly; and it almost never recovers.

Why this happens (systems-level)

High-cardinality series tend to be sparse and short-lived.

A container spins up, emits metrics for twenty minutes, then disappears. The series doesn’t end—it just stops receiving data. Now the query engine has to decide: is that a zero, a gap, or a stale series?

Multiply that ambiguity across thousands of ephemeral series and add:

  • runtime label changes
  • partial scrapes under load
  • sampling or eviction when limits are hit

The result is charts that flicker and queries that return different shapes for the same question not because the system is wrong, but because the data is ambiguous.

How teams avoid this

Teams that maintain trust make uncertainty explicit:

  • dashboards distinguish zero vs missing vs filtered-out
  • sampling and aggregation are visible, not hidden
  • schema changes are intentional, not accidental

When engineers understand why a chart looks the way it does, they’ll trust it - even when it’s imperfect.

Understanding and explainability of behaviour is very crucial to build trust.

Understanding and explaining system behavior is not simply about UX. It's the core foundation of trust.

Break #4: Teams Over-Instrument Without Knowing Why

Cardinality is multiplicative.

High cardinality lowers the friction to add “just one more label.”  So teams do.

Over time:

  • instrumentation becomes inconsistent
  • the same concept is modeled differently across services or business units
  • no one remembers why certain dimensions exist

The result isn’t better observability. It’s technical debt presenting itself operationally.

Why this is risky

Teams start to feel it before they can explain it.

Queries don’t correlate the way they should. Dashboards feel brittle. Engineers develop a vague sense that “something is off,” but can’t point to a single place to fix it; or even agree on what the fix should be.

At that point, refactoring telemetry feels riskier than leaving it alone. There’s no single place to inspect and change it.

Why this happens (systems-level)

The math is unforgiving: cardinality is multiplicative, not additive.

Five labels with ten possible values each isn’t fifty series. It’s 10⁵  - one hundred thousand.

Add a sixth label with a hundred values and you’re suddenly at 10 million possible combinations.

Metrics systems make this easy to do accidentally. They’re append-only: adding labels is cheap, but removing them is expensive. Old series linger. Queries keep working; until they don’t.

So entropy wins, and the label space grows faster than anyone intended.

How teams avoid this

Mature teams introduce intent and ownership:

  • high-cardinality labels have owners
  • dimensions exist for a reason, not “just in case”
  • telemetry evolves deliberately, not accidentally

High-cardinality metrics work best when treated like APIs:  designed, reviewed, and occasionally removed.

The Real Risk Is Blindness.

Most teams don’t struggle with high-cardinality metrics because the idea is wrong.

They struggle because:

  • cost behavior is opaque
  • performance tradeoffs are hidden
  • data trust erodes silently
  • modeling decisions aren’t explicit

This is why high-cardinality observability feels risky-not because it’s unstable, but because the system doesn’t surface its own limits.

Teams that succeed don’t “turn it on.”  They design for it.

They assume cardinality will exist: and build guardrails, visibility, and intent around it.

Why This Matters When Choosing an Observability Platform

If a platform treats high cardinality as:

  • a pricing tier → risk stays with you
  • a query problem → incidents get harder
  • a sampling problem → signal gets lost

Then the risk hasn’t gone away, it is being deferred. The safer path is tooling that assumes: Teams will add high-cardinality data whether you like it or not.

And is designed to make that reality:

  • predictable
  • explainable
  • controllable

That’s not about features.  It’s about whether the system matches how modern teams actually operate.

The Quiet Takeaway

High-cardinality metrics still sound obviously right. The real question is whether your observability system was built to agree.

Before you add the next label, ask yourself:

Do I know what this will cost, how it will perform, and whether my team will still trust it six months from now?

If the answer is “I’m not sure,” that’s not a failure of discipline. It’s a signal that the system wasn’t designed for the data you actually have.

Coming next

In Part 2, we’ll get concrete.

CI checks that flag exploding labels. A Storage Cardinality Explorer.Streaming aggregations and ingest patterns that keep queries fast.Dashboard configurations that make “no data” explicit instead of misleading.

No theory - just the implementation patterns teams actually rely on to make high-cardinality metrics safe.

Authors
Prathamesh Sonpatki

Prathamesh Sonpatki

Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

X

Contents

Do More with Less

Unlock unified observability and faster triaging for your team.