Last9

The Business Case for High-Cardinality Observability

Understand how high-cardinality observability solves costly tool-switching and slow incident resolution for engineering teams.

Apr 30th, ‘25
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

Traditional monitoring excels at detecting when something breaks. The challenge comes in the next step: understanding what broke, for whom, and why. When your metrics lack sufficient dimensions, isolating the root cause becomes an exercise in tool-hopping and manual correlation.

High-cardinality observability changes this. With the ability to add labels like region, feature_flag, or user_tier, you can isolate problems faster and make clearer product decisions. The result is lower incident costs, better uptime, and monitoring spend that better reflects the value you get from it — though actual savings depend on your specific incident patterns and current tooling costs.

In this piece, we’ll look at:

  • The hidden costs of traditional monitoring.
  • How high-cardinality data improves investigation speed and scope clarity.
  • Options for implementing high-cardinality observability in your stack.

Traditional Monitoring Costs More Than It Seems

On paper, traditional monitoring stacks look straightforward — predictable per-metric pricing, clear storage estimates, and scaling plans that seem manageable. In reality, the spending often goes far beyond the invoice.

The real cost shows up in the time, effort, and lost clarity when your tools can’t answer the key question fast enough: who’s affected, where, and why?

1. Missing Context Slows You Down

When metrics can’t be segmented by dimensions like tenant, customer_tier, or region, small-scope incidents get averaged into the whole. A single tenant’s latency issue might disappear into an overall “healthy” metric until the same problem spreads to multiple tenants — by then, the impact has widened. Without those granular labels, MTTD increases, and engineers often hear about incidents from customers before monitoring alerts fire.

2. More Tools, More Hops

When one system can’t give complete visibility, teams often adopt a mix of tools:

  • Metrics in Prometheus for time series monitoring.
  • Logs in ELK for event-level analysis.
  • Traces in Jaeger for request path visibility.
  • APM in Datadog or similar for application performance tracking.

This mix-and-match approach usually emerges for two reasons:

  1. Capability gaps — Most tools don’t cover all telemetry types with the required depth.
  2. Cost constraints — High-volume logs or traces can be expensive to store in one platform, prompting teams to offload them to cheaper or open-source alternatives.

While it solves immediate gaps, this sprawl makes incident investigations slower and more manual. Each tool has its own query language, data model, and access controls.

During an outage, engineers jump between dashboards, manually align timestamps, and stitch together context from disconnected datasets. At high cardinality, queries in these separate systems can take minutes to return, turning a quick diagnosis into a drawn-out investigation.

3. Scaling Brings New Overhead

Not every use case requires high cardinality. Many applications perform well with traditional monitoring focused on service-level and infrastructure metrics. High cardinality becomes valuable when you need to isolate issues by customer, tenant, geographic region, or feature flags — scenarios common in multi-tenant SaaS applications, e-commerce platforms, or systems serving diverse user segments.

Self-managed systems can handle high-cardinality workloads, but not without thoughtful tuning and monitoring. In practice, a well-provisioned Prometheus server handles 1–2 million active time series and about 100,000 samples per second comfortably, though pushing beyond that requires careful architectural work.

Experienced users also report scaling up to 10 million active time series with enough hardware and tuning.

  • Shard Prometheus instances to spread the load.
  • Deploy a high-availability (HA) setup for resilience.
  • Add a long-term storage layer for retention beyond local disk limits.

These changes often require dedicated engineers to maintain query speed and prevent outages in the observability layer itself.

Your monthly monitoring bill is just the visible cost. The hidden costs — slower detection, fragmented workflows, engineering time spent on tool management, and scaling complexity — can be far greater.

Where the ROI Come From

High-cardinality observability addresses these challenges by keeping detailed labels available and enabling unified correlation across metrics, logs, and traces. The result: faster, clearer investigations — not just from richer labels, but from faster detection, sharper scoping, and more direct workflows.

  • Unified query and correlation
    A spike in api_latency_seconds{service="checkout"} can be filtered by tenant_id or feature_flag and instantly linked to related logs and traces from the same time range — all in one place. This eliminates context switching and reduces mean time to detect (MTTD).

  • Better Selectivity for Faster Detection
    Adding more specific label filters narrows the number of time series scanned, which shortens the investigation loop and speeds up MTTD.

api_latency_seconds{service="checkout"}
→ ~1M series
api_latency_seconds{service="checkout", region="us-east", tier="premium"}
→ ~10K series

With proper indexing and sufficient memory allocation in a time-series backend, the second query returns results from a much smaller, relevant subset of data. This means engineers can go from a broad “checkout latency” alert to a pinpointed issue — like “premium users in us-east” — in seconds, reducing the time spent sifting through unrelated signals.

The fundamental challenge with high cardinality is that storage and query performance costs don’t scale linearly. Index sizes grow roughly quadratically with the number of unique label combinations, which is why traditional time-series databases struggle beyond certain thresholds. For many teams, sampling strategies provide a middle ground—collecting high-cardinality data for a percentage of requests while maintaining lower-cardinality aggregates for overall system health.

  • Operational efficiency at scale
    A backend built for high cardinality uses techniques like label-aware indexing, ingest-time aggregation, and tiered retention to avoid constant sharding or downsampling. This keeps detailed labels queryable in real time, so engineers can investigate issues without adding infrastructure complexity.

  • Retention without losing detail
    A storage engine that de-duplicates label keys and values across series can keep high-cardinality datasets in hot storage for longer without a steep cost increase, particularly when label values have high repetition rates. This means you can preserve full context — every label you need — for historical investigations, without relying solely on downsampled or aggregated data.

The payoff comes from combining these gains: precise isolation, faster detection, fewer manual steps, and less engineering effort spent just keeping telemetry running — all of which translate to faster incident resolution and better use of your team’s time.

2 Approaches to High-Cardinality Observability

There are two main ways teams approach high-cardinality observability — run your own open-source stack or go with a commercial platform.

Both can deliver the ability to store and query high-dimensional telemetry, but the trade-offs — and how you plan for them — are very different.

Self-Managed (Open Source & Custom Stacks)

A self-managed approach gives you full control over how telemetry is ingested, stored, queried, and scaled — and the trade-offs that come with each choice.

Some teams extend what they already run. If you’re using Prometheus, adding a backend like VictoriaMetrics or Grafana Mimir raises the ceiling well beyond Prometheus’s usual limits without forcing you to change queries or dashboards.

A single Prometheus instance can handle 1–2M active time series comfortably, and with careful tuning and sufficient hardware, teams have pushed beyond 10M.

At that scale, latency climbs because every extra label combination is a new series the engine must scan and aggregate. Storage pressure rises too — each series carries index data and chunk metadata in addition to samples, so millions more series mean more memory and disk use.

Backends like VictoriaMetrics and Mimir are built to distribute load across nodes, handle more concurrent queries, and optimize high-cardinality workloads with features like streaming aggregation and index compression — all while keeping the PromQL developer experience intact.

Another option is to start fresh with databases built for time series at a massive scale:

  • ClickHouse — A columnar database that shines for analytical queries across billions of rows. With optimized schemas and adequate hardware provisioning, and when data is partitioned by time and indexed by dimensions, it can deliver sub-200 ms queries even when scanning millions of series. Dictionary encoding and advanced codecs (such as ZSTD) can reduce storage by 5–10× for repeated label values, since adjacent identical values compress extremely well.

  • TimescaleDB — Built on PostgreSQL, offering hypertables and native time-series functions. It’s ideal when you need both relational joins (e.g., metadata tables) and time-series analytics in the same query, and want to reuse SQL skills.

  • Apache Druid — Optimized for real-time ingestion plus sub-second aggregations. Its segment-based storage model handles high-dimensional data well, making it suitable for interactive dashboards with constantly updating streams.

Key considerations:

  • You’ll need to manage sharding across nodes and ensure HA setups for resilience.
  • Query performance depends on smart indexing, partitioning, and in some cases, pre-computation (recording rules, materialized views).
  • Storage growth is non-linear with cardinality; compression formats, label de-duplication, and aggregation strategies are essential to keep costs in check.
  • Consider whether your use case actually requires high cardinality. Many teams successfully run Prometheus with proper label design and selective high-cardinality collection for specific debugging scenarios.

This path rewards you with strong infrastructure skills and a willingness to invest engineering time into tuning and scaling. The upside: you decide exactly what trade-offs to make between performance, cost, and retention.

Commercial Platforms

A commercial platform takes most of the infrastructure burden off your team by running the observability backend for you.

Teams often choose between approaches based on their operational maturity and specific requirements. Self-managed solutions work well for organizations with strong infrastructure expertise and predictable scaling needs. Commercial platforms suit teams prioritizing development velocity over infrastructure control, especially when compliance, SLAs, and vendor support matter more than customization flexibility.

Vendors like Datadog, New Relic, and more operate ingestion pipelines, query execution, indexing, retention tiering, and multi-region replication. You can send high-cardinality metrics, logs, and traces and run multi-dimensional queries without having to manage storage node scaling, tune PromQL execution, or design sharding strategies yourself.

In commercial setups, the focus shifts from managing infrastructure to managing data volume and cost — deciding how much telemetry to send, which dimensions to retain, and how long to keep it without overshooting budgets or performance goals.

These platforms typically:

  • Distribute ingest across horizontally scaled storage and query clusters.
  • Maintain label indexes and time-partitioned data to resolve high-cardinality filters efficiently.
  • Apply compression, dictionary encoding, and segment pruning to keep storage growth and query times predictable.
  • Automatically rebalance workloads and manage hot/cold storage tiers to optimize performance and cost.

Key considerations:

  • Cost is the main constraint, but performance still depends on efficient label filtering and query design.
    Pricing is usage-based: per host/container for infrastructure metrics, per million spans for traces, per GB for logs, and either bundled or separate pricing for metrics. High ingest volumes, long retention periods, or extremely detailed label sets can quickly drive up monthly costs.
  • Backend control is limited — you can’t always change storage formats, query execution strategies, or retention policies the way you can in a self-managed environment.

Here’s how you can address them:

  • Built-in aggregation pipelines — Aggregate or bucket data as it’s ingested to cut down on storage and query costs, while keeping the label dimensions that matter for investigations.
  • Retention tiering — Keep high-cardinality data in hot storage for immediate operational use, then move older data to lower-resolution, cheaper tiers without losing long-term visibility.
  • Label filtering — Drop low-value or noisy labels (e.g., test_env, debug_id) before ingestion to keep series counts under control and costs predictable.

At a glance:

  • Self-managed stacks give you maximum control and tuning flexibility, but keeping performance predictable at high cardinality requires ongoing engineering effort.
  • Commercial platforms handle scaling and backend performance for you, but keeping costs predictable means actively managing ingest volume, retention, and label strategy.

Solving the Performance vs Cost Equation

Self-managed stacks put you in control but require ongoing performance tuning as cardinality grows. Commercial platforms handle performance for you, but make cost control the primary challenge.

Last9 bridges that gap — delivering high-cardinality performance without the operational tax, and predictable costs without per-host or per-metric penalties.

Purpose-Built for High Cardinality

Last9’s quotas start at 20M+ active time series per metric per day while keeping queries fast.

The Control Plane also lets you utilize Streaming Aggregation, processing telemetry before it’s stored to turn incoming metrics, events, and logs into scoped metrics that retain the labels you care about, while being cheaper to store and faster to query — all purpose-built to optimize performance and storage at high cardinality, providing better return on investment to our customers.

“Using Last9’s high cardinality workflows, we accurately measured customer SLAs across dimensions, extract knowledge about our systems, and measure customer impact proactively.”
— Ranjeet Walunj, SVP Engineering, Clevertap

Unified Telemetry Control

With the Control Plane, you can govern ingestion, storage, and query behavior in real time. From the moment telemetry enters the system, you can:

  • Drop noisy or unused metrics
  • Transform labels for consistency
  • Route data to appropriate storage tiers

All without touching application code or redeploying services.

Unified Signals, Faster Detection

Last9 connects metrics, logs, and traces using OpenTelemetry standards. From a metric spike, jump straight to the exact log line or trace span in one click — no grepping, no timestamp matching.

Real-time remapping automatically standardizes inconsistent field names like custID and customerID, removing query friction and accelerating investigations. The result: lower MTTD and faster, more confident root cause isolation.

Built-In Cost Awareness

The Cardinality Explorer shows how metric cardinality is trending at the cluster level, so you can spot growth early, see which labels contribute most, and take action before costs or query performance are affected.

“That we don’t have to fight our system’s natural high cardinality anymore is liberating.”
— Matt Iselin, Head of SRE, Replit

Transparent, Event-Based Pricing

Event-based pricing means you pay only for the telemetry events you ingest — whether metrics, logs, or traces — rather than infrastructure size. This keeps costs predictable even as cardinality grows, so you can retain all the labels you need for fast, precise investigations.

Faster Path to Value

Because Last9 works with your existing instrumentation and dashboards, teams often reach production readiness in weeks. You get both speed and control from day one — along with the visibility needed to detect, investigate, and resolve incidents faster.

The simplest way to understand the value of high-cardinality observability is to see it working on your data.

Talk to our experts and get started today!

In the upcoming articles, we’ll break down dimensionality, how it shapes high cardinality, and practical ways to manage it.

Contents

Do More with Less

Unlock unified observability and faster triaging for your team.