Last9

Hidden Correlations Traditional Monitoring Misses

Understand how detailed telemetry uncovers correlations and reveals why certain user groups face issues that top-line metrics miss.

Sep 15th, ‘25
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

Your dashboards are all green.
Response times look fine, error rates are low, and CPU usage is steady.

And yet, support tickets start piling up — customers are reporting slow checkouts.

Here’s what your dashboards show:

  • Average response time: 280 ms
  • Error rate: 0.2%
  • CPU utilization: 60%

All well within thresholds. Nothing to panic about.

But when you slice the data by attributes, the picture changes:

checkout_latency{region="us-west", payment_method="apple_pay", user_segment="premium", device="mobile"} = 8.2s
checkout_latency{region="us-west", payment_method="stripe", user_segment="premium", device="mobile"} = 190ms

For premium mobile users in us-west, Apple Pay checkouts are ~43× slower than Stripe. This group is just 1.8% of overall traffic — statistically invisible in aggregate metrics — but they’re also high-value customers.

So why does this happen?
Because most monitoring pipelines optimize for cost and speed. They pre-aggregate data at ingestion or strip out certain dimensions before queries run. That makes dashboards fast and lightweight — but it also erases the multi-dimensional combinations (region, payment method, device, customer tier) you need to pinpoint localized or segment-specific issues.

In this part of the High Cardinality series, we explore the correlation gap — how it shows up in telemetry pipelines, where dimensions are lost, and the patterns that help preserve them.

The Correlation Gap

A correlation gap occurs when the relationship between multiple signals exists in the raw telemetry but cannot be observed in the processed dataset. This usually happens when aggregation, sampling, or context loss removes the key dimensions needed to connect those signals.

From a systems perspective, there are three main causes:

  1. Metric-level isolation
    Metrics are collected and stored per service or component without a common correlation key. For example, database latency metrics and API request latency metrics might both show anomalies, but without a shared trace_id or request_id, it’s not possible to confirm they are part of the same user transaction.

  2. Lossy transformations
    Aggregation is essential for keeping query performance predictable, but where in the pipeline it happens matters. If aggregation is applied before essential dimensions — such as region + version + client_id — are retained or indexed, those combinations can’t be reconstructed later.

    Correlation-aware designs, such as streaming aggregation with preserved dimensions, keep critical context available while still managing query latency and storage usage.

  3. Context fragmentation
    Logs, metrics, and traces often use different timestamp resolutions, inconsistent label names, or incompatible identifiers. Without alignment at collection time, joining them later is computationally expensive or infeasible in real time.

The result is a visibility gap: the relevant correlation technically exists in the system’s telemetry, but the stored representation no longer contains enough aligned, granular detail to surface it. In high-throughput environments with millions of unique series, even a small amount of early data loss or schema mismatch can make cross-signal analysis slow or incomplete.

Where Correlations Disappear

In complex systems, most incidents aren’t caused by a single failing component. They surface when multiple factors — region, version, configuration, workload distribution — intersect in unexpected ways. These are exactly the kinds of patterns that vanish when dimensional context is stripped away:

  1. Regional Performance Traps

    An API’s global latency metric might show stable performance: api_latency{endpoint="/search"} = 180ms

    But introducing regional and tier breakdowns can surface outliers: api_latency{endpoint="/search", region="ap-southeast", user_tier="enterprise"} = 4.1s

    Here, enterprise-tier users in Southeast Asia experience a ~23× slowdown. Without those dimensions, the degradation is invisible at the aggregate level — even though it affects a high-value segment.

  2. Version-Specific Failures

    At a service level, payment processing success appears healthy: success_rate{service="payment-processor"} = 99.1%

    Correlating with deployment and client metadata reveals a targeted issue: success_rate{service="payment-processor", version="v3.2.1", client_sdk="mobile-2.4"} = 73%

    The issue occurs only when a specific backend release interacts with a particular mobile SDK version. Without cross-referencing service deployment data and client identifiers, this correlation is hard to detect.

  3. Routing and Load Distribution Asymmetry

    Cluster metrics can show balanced utilization: memory_usage{cluster="production"} = 58%

    But when filtered by pod and routing configuration: memory_usage{cluster="production", pod="user-service-7f", session_affinity="enabled"} = 94%

    Session affinity causes certain pods to receive a disproportionate share of requests, leading to localized resource pressure. The correlation becomes clear only when routing policy, pod identity, and usage metrics are viewed together.

Across all three cases, the underlying issue is the same: the signals that explain the anomaly exist, but they can only be connected when the right dimensions are preserved and accessible at query time.

This is why correlation-ready telemetry — where metrics, logs, and traces share aligned metadata — is critical for fast, targeted investigation.

Why Correlations Get Lost

The earlier examples we discussed are what it looks like when important signals don’t show up in your dashboards. The underlying reason often comes down to how telemetry pipelines manage data. Sampling and aggregation are essential — they keep systems stable, cost-effective, and performant. But depending on when and where they’re applied, they can also remove the dimensions you later need for cross-signal analysis.

1. Sampling in Distributed Tracing

Sampling helps control storage, processing load, and cost. Capturing every trace in a high-throughput system isn’t feasible, so teams decide which traces to keep. Different strategies surface different trade-offs:

  • Head-based sampling makes the decision as soon as a trace starts, often before meaningful work occurs. This keeps overhead low and ingestion predictable. The trade-off is that rare conditions later in the request path — a slow downstream call, or an edge case behind a feature flag — may not be captured.

  • Tail-based sampling waits until the trace finishes. This makes it possible to apply rules like keep if duration > 2s or keep if error=true, ensuring anomalies are more likely to be retained. The challenge is that every span must be buffered until the trace completes, which adds memory overhead and latency at scale.

In practice, tail-based sampling favors richer visibility, while head-based sampling favors efficiency.

2. Aggregation Point in Metrics Pipelines

Metrics pipelines also use aggregation to stay manageable. By reducing the number of time series stored, they:

  • Keep storage and indexing costs under control.
  • Improve dashboard responsiveness for common queries.
  • Stabilize ingestion during cardinality spikes.

Where aggregation happens, however, makes a difference:

  • Early aggregation reduces data size quickly, for example, by averaging checkout_latency across all payment_method values before persisting. Dashboards become faster, but fine-grained combinations — region + version + client_id — are no longer available.

  • Streaming aggregation with preserved dimensions holds onto key labels like trace_id, payment_method, and region long enough for correlation, then rolls up less critical ones for long-term storage. This keeps performance predictable while still supporting detailed analysis when needed.

3. Combined Effect

When sampling decisions are made before anomalies appear, and aggregation happens before essential dimensions are stored, certain correlations naturally become harder to detect.

Take a rare latency spike tied to a specific mobile client and payment gateway. If head-based sampling drops the trace, the anomaly isn’t recorded. If the trace is kept but early aggregation removes client and payment details, the connection to related metric patterns can’t be made later.

This is why many teams adopt correlation-aware pipelines:

  • Sampling rules are tuned to investigative goals.
  • Key dimensions are preserved until cross-signal joins are possible.
  • Streaming aggregation balances cost efficiency with the need for detailed context.

The goal is to keep enough aligned detail available when you need to explain an anomaly.

How Cloud-Native Architectures Create More Dimensions

Cloud-native systems generate telemetry with far more detail than traditional setups. That’s because every layer of the stack adds its own labels, and those labels combine in ways that quickly multiply.

  • Pods and containers: In Kubernetes, pods and containers are short-lived. Each restart or redeploy introduces new identifiers like pod_id or container_id. Even when the workload is steady, the churn creates new series.

  • Service meshes and sidecars: Adding a service mesh means each request passes through proxies. Those proxies attach attributes such as upstream_service, cluster, or policy, expanding the dimensional space for traces and metrics.

  • Continuous delivery: Frequent releases mean version labels like commit_sha, build_id, or image_tag change constantly. Each change leaves behind a unique time series tied to that version.

  • Multi-region and multi-cloud: Running across regions or providers multiplies the series count. A metric that looks simple in one cluster becomes many when broken down by region, zone, or cloud provider.

  • Feature flags and experiments: Modern applications often include feature toggles and A/B tests. Dimensions like feature_flag or experiment_group create fine-grained series that are valuable for debugging but increase overall cardinality.

All of these dimensions are useful — they describe how the system is running, who is affected, and under what conditions. Together, they explain why cloud-native environments naturally produce millions of unique time series, even when the actual traffic load hasn’t changed.

The Math Behind Correlation at Scale

Identifying correlations in high-cardinality telemetry is not only a question of retaining the right data — it’s also a computational challenge. When each metric can have millions of unique label combinations, naive analysis quickly becomes infeasible.

At the simplest level, running pairwise correlation analysis across all metric pairs has O(N²) complexity. Research shows that even smaller datasets face significant computational challenges - detecting correlations on just 1000 vectors involves examining trillions of combinations. At a scale of 1 million unique time series, that’s approximately 500 billion pairwise comparisons (N²/2 where N = 1,000,000), creating computational demands that exceed what’s practical for real-time incident response systems.

More advanced techniques like hierarchical clustering can reach O(N³) complexity, and memory costs grow quickly with correlation matrices. Hierarchical clustering algorithms require O(n³) time complexity and O(n²) memory.

For correlation matrices, a 10,000-metric matrix requires approximately 100 million entries (~400 MB), while scaling to 1 million metrics pushes memory requirements into the terabyte range due to the quadratic growth in matrix size.

To manage these constraints, large-scale telemetry systems often use a combination of strategies:

  • Targeted statistical checks: Running simple correlations (e.g., Pearson, mutual information) on scoped subsets of metrics to validate suspected relationships without scanning the entire dataset.

  • Scoped anomaly detection: Applying clustering or ML-based analysis only to metrics filtered by relevant attributes such as service, region, or anomaly tags.

  • Stream-based joins: Correlating metrics in-flight using tools like Apache Flink or Kafka Streams, which avoids querying large historical datasets.

  • Memory-efficient counting: Using probabilistic data structures like HyperLogLog or Count-Min Sketch to track unique label combinations or frequency estimates in constant space.

These approaches are not mutually exclusive — the right mix depends on architecture, workload, and latency requirements.

In practice, the key is to narrow the search space quickly, retain the dimensions most likely to explain anomalies, and apply heavier analysis only when the scope is small enough to be computationally efficient.

Patterns That Preserve Correlation

Reducing the time it takes to uncover meaningful correlations depends as much on how telemetry is collected and stored as on the queries run against it.

In high-cardinality environments, these design principles consistently improve correlation visibility.

1. Preserve Context Across Boundaries

Every telemetry signal — metric, log, trace, event — should carry the identifiers that allow it to be linked to related signals.
This means:

  • Using consistent trace propagation formats, such as W3C Trace Context, so trace_id and span_id survive service hops, message queues, and async workflows.
  • Ensuring business and operational metadata (e.g., customer_tier, payment_method, region) is tagged consistently at the point of collection.

Without this, even if the raw data exists, joining across services or telemetry types becomes computationally expensive or infeasible in real time.

2. Design for Dimensions

Aggregation should be a conscious trade-off, not a default. Keep dimensional richness — such as version, deployment environment, device type, and routing policy — intact until after investigative queries have run.

Once pre-aggregation removes these attributes, correlations that rely on them cannot be reconstructed later.

A correlation-ready backend will:

  • Retain high-cardinality dimensions during active analysis.
  • Apply streaming aggregation only after correlation keys are indexed and available.
  • Support querying on combinations that would be invisible in pre-aggregated metrics.

3. Implement Streaming Correlation

Correlations are most valuable while an incident is still unfolding. Waiting to query large historical datasets slows detection and increases time-to-resolution.
Stream-based correlation pipelines — using platforms like Apache Flink or Kafka Streams — allow:

  • Real-time joins across metrics and traces using shared identifiers.
  • Immediate anomaly tagging for specific dimensions.
  • Progressive narrowing of the search space before heavier analysis.

4. Unify Telemetry Schemas

Correlation queries are faster and simpler when all telemetry types share:

  • Aligned timestamps at a consistent resolution.
  • Standardized metadata keys (region, not geo in one dataset and location in another).
  • Common correlation identifiers across metrics, logs, and traces.

A unified schema removes the need for complex transformations or slow, expensive joins during an investigation.

The Wide Events Pattern in Modern Observability

The wide events approach has been advocated by observability practitioners for years. Brandur Leach described it in 2019 as “Canonical Log Lines” at Stripe, and AWS now recommends it as a best practice. The idea is straightforward: instead of scattering details across multiple logs, metrics, and traces, emit a single context-rich event per service hop that carries everything needed to understand what happened.

A typical wide event includes:

  • Trace and span IDs for correlation
  • Service and instance identifiers
  • Business context, such as subscription tier, payment method, and region
  • Performance metrics like latency or error codes
  • Request metadata and user attributes

With this model, every row of data is already packed with context — query identifiers, pod names, version metadata, and network details — without requiring pre-aggregation or dimension trimming. That means you can run queries like:

SELECT AVG(latency_ms)
FROM events
WHERE payment_method = 'apple_pay'
AND region = 'us-west'
AND user_segment = 'premium';

These queries remain fast and reliable because the context was stored upfront, rather than stitched together later with expensive joins.

Netflix has discussed its event-based observability systems, where domain-specific events, distributed tracing, and consistent tagging are combined with stream-processing pipelines that handle billions of events in real time.

Across the industry, teams building at scale are adopting wide events as part of a cultural shift toward high-cardinality observability. Instead of treating logs, metrics, and traces as separate silos, the wide events model preserves context at the row level and keeps it available for correlation when it matters most.

The Value of Preserving Dimensions

A correlation-ready backend ensures the right dimensions remain queryable exactly when you need them. When the principles are applied, incident investigation changes in three important ways:

  1. No dead ends in queries
    Metrics, logs, and traces retain linking keys — trace_id, span_id, service, and business dimensions — until after you’ve had the chance to use them. This makes it straightforward to connect a latency spike in a checkout service to a specific payment provider and customer tier without rebuilding datasets.
  2. Real-time, high-cardinality filtering
    Essential dimensions are preserved during streaming aggregation, so you can run queries like:
    region="us-west" AND payment_method="apple_pay" AND user_segment="premium"
    While an incident is still in progress, instead of waiting for roll-ups that might flatten the pattern.
  3. Aligned telemetry across signals
    Metrics, events, logs, and traces share the same schema and timestamps, so joins are simple and computationally efficient.

In many environments, these capabilities are trade-offs — teams often pre-aggregate to keep queries fast, drop certain labels to control cost, or sample more aggressively to reduce ingestion.

Last9 — a telemetry data platform is designed so these trade-offs are optional, not forced:

  • Keep all the context: Retain every attribute you care about, even during cardinality spikes.
  • Query while it’s happening: Run high-dimensional correlation checks in real time, without waiting for data roll-ups.
  • Adjust without redeploying: Drop noisy metrics, rename labels, or route data directly from the Control Plane.
  • Unified telemetry: Metrics, events, logs, and traces share the same view and timestamps, making investigation smoother.

“Using Last9’s high cardinality workflows, we accurately measured customer SLAs across dimensions, extract knowledge about our systems, and measure customer impact proactively.”
— Ranjeet Walunj, SVP Engineering, CleverTap

The outcome is straightforward: when a pattern matters — whether it’s a regional slowdown, a version-specific bug, or a previously unseen interaction you didn’t know to look for — the dataset already contains the context to explain it, and the answer is available while it’s still actionable.

If this sounds something you’ve been looking for, talk to our experts about how Last9 can fit into your stack — so the correlations you need are always available when you need them.

In upcoming articles, we’ll cover OpenTelemetry and modern tooling for high cardinality — from OTel’s cardinality-friendly design and attribute strategies to moving beyond Prometheus limits while still sharpening your PromQL skills at scale.

Contents

Do More with Less

Unlock unified observability and faster triaging for your team.