Last9 named a Gartner Cool Vendor in AI for SRE Observability for 2025! Read more →
Last9

7 Observability Solutions for Full-Fidelity Telemetry

A quick guide to how seven leading observability tools support full-fidelity telemetry and the architectural choices behind them.

Nov 24th, ‘25
7 Observability Solutions for Full-Fidelity Telemetry
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

TL;DR

You don’t have to choose between capturing every signal and keeping costs predictable. Modern observability stacks blend full-fidelity storage (time series or columnar systems like ClickHouse and Apache Druid), tail-based sampling for heavy traffic, and tiered storage (hot/warm/cold with S3-backed archives).
This gives you full-fidelity incident forensics with the day-to-day cost profile of a sampled setup.

Platform comparison (Top 7)

  • Last9 — A telemetry data platform with a pre-ingestion Control Plane for aggregation, logs-to-metrics, filtering, and routing. Event-based pricing. Did Out of the 20 largest livestreamed events in history, 12 were monitored with Last9—demonstrating proven scalability!
  • Datadog — Easiest to adopt and heavily agent-centric. Head-based sampling, polished UI, 800+ integrations. Typically $5K–25K/mo for ~100 hosts.
  • New Relic — Strong high-cardinality support with adaptive sampling and tail-based “Infinite Tracing.” User-based pricing works well for APM-heavy teams.
  • Grafana Cloud — Open standards (Mimir, Loki, Tempo). No mandatory sampling. Adaptive Metrics reduces unused series. Usually 6–9× cheaper than Datadog, but more ops-heavy.
  • Calyptia / Fluent Bit — High-throughput collectors for logs/metrics/traces. Ideal for shaping noisy telemetry at the edge before it hits storage.
  • Elastic Observability — Search-first architecture with hot/warm/frozen tiers and S3-backed snapshots. Great for log-heavy environments.
  • Lightstep — Focused on distributed tracing with tail-based sampling, service diagrams, and detailed latency analysis. Strong debugging workflows, good for teams deep into tracing-oriented observability.

Introduction

You can run an observability platform that ingests and keeps full-fidelity telemetry, trims noise in real time, and stores data without blowing up your budget — as long as you use columnar storage with smart sampling and tiered retention.

You've probably felt the tension between two choices: collect everything and watch costs climb, or reduce data and worry about missing the one trace that would've explained last night's issue. And when something breaks at 3 am, having the right signals in front of you often decides whether you fix it in minutes or keep guessing.

The usual full-fidelity-versus-sampling debate is changing. With columnar storage, tail-based sampling, and layered retention, you can keep investigation-ready data without holding everything in hot storage.

In this blog, we talk about how full-fidelity telemetry systems actually work, how observability platforms like Last9, Datadog, Grafana Cloud and more approach these choices, and the patterns you can rely on when you're running observability at scale.

Why Full-Fidelity is Important (And What It Costs)

Full-fidelity telemetry gives you the full story of what happened in your system — every flow, packet, log line, metric sample, and trace span. During an outage, this level of detail helps you reconnect the sequence of events without wondering whether a key signal dropped somewhere in the pipeline.

This matters because modern systems produce failure patterns that are rarely clean:

  • A slow dependency triggers a queue buildup
  • A retry spike pushes CPU usage upward
  • A small config change in one service affects another in unexpected ways

When you have all the signals, you can walk through that chain calmly and fix the issue with confidence instead of piecing things together from partial data.

The challenge is scale

Moving from a monolith to Kubernetes changes your telemetry footprint completely:

  • a workload that once generated 150,000 time series can grow into 150 million unique series
  • Each pod, node, and label combination adds more dimensionality
  • cluster churn (pod restarts, autoscaling) adds even more series over time

A single 200-node cluster tracking labels like userAgent, sourceIPs, nodes, and status codes can create roughly 1.8 million custom metrics — often translating to about $68,000 per month on traditional platforms.

Across the industry, the numbers reflect this growth:

Coinbase's 2021 example showed how fast this grows:

  • Datadog bill reached $65M
  • monthly transacting users climbed to 11.4 million
  • Pricing was later renegotiated — but it highlighted how quickly telemetry-related costs rise as systems scale

Full-fidelity telemetry gives you clarity during incidents; the cost comes from how modern infrastructure multiplies the volume and variety of what you collect. Balancing both ends — the detail you want and the scale you operate at — is now part of every observability decision.

How Full-Fidelity Systems Work

A full-fidelity system keeps every signal intact, stores it in the right layer, and still responds quickly even as your telemetry volume grows. The ingestion layer is where this reliability starts.

Ingestion: Getting Data In Without Loss

Your telemetry pipeline usually moves through a straightforward sequence: sources → agents → buffering → processing → storage → queries. The configuration at each step determines whether you can hold full-fidelity data while keeping things efficient.

Modern systems accept a wide mix of protocols. OTLP — over gRPC (4317) or HTTP (4318) — has become the standard because Protocol Buffers keep messages compact, and gzip compression typically cuts bandwidth by 50–60%. Alongside OTLP, you'll often see Prometheus Remote Write, StatsD, and various network flow formats, especially in hybrid or multi-team environments.

High-volume platforms regularly ingest 400,000 to 1,000,000 samples per second. Prometheus Remote Write handles this scale through shard-based parallelism. It runs multiple concurrent streams, with batch sizes around 2,000 samples by default and often tuned to 5,000–10,000 for busy clusters. The Write-Ahead Log (WAL) adds resilience with roughly two hours of buffered data during outages.

The OpenTelemetry Collector adds a small 200–500ms batching delay, but it brings important features you rely on:

  • memory limiting to avoid out-of-memory situations
  • PII redaction
  • protocol translation
  • automatic retries
  • routing to multiple backends

Most teams deploy it using an agent + gateway pattern. Lightweight agents collect telemetry locally on each host, while centralized gateways apply organization-wide rules and handle heavier tasks like tail-based sampling.

Storage: How Columnar Databases Change Full-Fidelity Economics

Time-series databases have powered production monitoring for years. They're excellent for workloads with stable metric names, predictable labels, and fast point lookups. For cluster health, system metrics, container stats, and steady operational signals, TSDBs remain one of the most efficient tools you can run.

Full-fidelity telemetry, however, behaves more like an analytical dataset than a transactional one. When you want every flow, span, field, and tag — especially across hundreds of millions of combinations — the workload shifts toward wide scans, grouping, filtering, and high-cardinality exploration. That's where columnar storage brings clear advantages.

Traditional TSDBs use row-oriented models where each unique time series carries its own index entry. This keeps ingestion very fast, but at scale, the in-memory index can reach tens of gigabytes. Once you grow into the tens or hundreds of millions of series, complex queries simply have more metadata to touch, which affects latency.

Columnar engines — ClickHouse, Apache Druid, InfluxDB IOx (Arrow/Parquet) — store each field and tag in its own column. This enables:

  • 10–100× compression
  • vectorized scans with SIMD
  • predictable query times even when cardinality grows
  • efficient grouping and filtering across rich, high-dimensional data

This is why many large-scale organizations use both TSDBs and columnar stores side by side. The pattern is consistent: TSDBs excel at steady operational metrics, and columnar engines shine when you need fast exploration across detailed, high-cardinality telemetry. Full-fidelity systems tend to lean on columnar backends not as a replacement, but as the right fit for the type of queries full-fidelity data encourages.

Tiered Storage: The Cost-Efficient Architecture

Storing everything in hot SSD-backed storage isn't sustainable when you retain full-fidelity data. Modern systems solve this with a three-tier model:

Tier 1 — Real-Time (Seconds to Hours)

  • SSD
  • Sub-second query latency
  • Used for alerting and immediate troubleshooting

Tier 2 — Hot/Exploration (Days to Weeks)

  • SSD/HDD blend
  • Great for debugging, on-call work, and ad-hoc exploration
  • Often ~14 days for logs/traces, ~90 days for metrics

Tier 3 — Cold/Long-Term (Months to Years)

  • Object storage, such as S3
  • Queries take longer but are ideal for audits and historical analysis
  • 10–20× cheaper than hot storage

The real unlock is on-demand rehydration.

Last9 follows a similar idea with a single write endpoint, where your services emit once and the platform applies automated tiering. Data is written into Blaze, Hot, or Cold based on the retention settings you configure, and the system moves it between tiers as it ages.

You don't manage separate pipelines or adjust instrumentation for different tiers — the Control Plane and tiering engine handle promotion and retention automatically.

Sampling Versus Full-Fidelity: The Real Trade-Offs

Sampling is popular because it keeps costs predictable and infrastructure light, while full-fidelity gives you every signal — even the ones you didn't know you needed. Most teams use both in different ways, depending on availability, goals, data volume, and budget.

Below is how the approaches differ and why many organizations still choose sampling even when they value full detail.

Why Sampling Is Preferred in Many Systems

Even teams that want complete visibility often choose sampling because:

Costs scale with volume
Keeping every span, log, and metric burst leads to storage growth that's hard to forecast.

Infrastructure stays simpler
Sampling reduces ingestion traffic, backend writes, index sizes, and query load.

Most requests look the same
If 95% of your traffic is healthy, sampling a small fraction of it still paints an accurate baseline.

Vendors price by cardinality, spans, or events
Sampling is the easiest way to keep bills stable when workloads grow horizontally.

This is why sampling became the default in most vendor products — not because it produces better debugging signals, but because it keeps operational overhead and pricing predictable.

When you work with Last9, you never tune sampling rates or chase missing spans because the architecture is designed to hold every signal. The platform stores full-fidelity by default and tiered retention to manage cost instead of dropping data.

Head-Based Sampling: Fast and Lightweight

Head-based sampling makes a decision as soon as a trace begins. A 10% probabilistic sampler keeps 1 out of 10 traces; Datadog's agent defaults to 10 traces/sec per agent, adjusting based on incoming traffic.

Why teams pick it:

  • Zero buffering
  • Very low processing overhead
  • Costs scale down linearly with sample rate
  • Good for predictable traffic patterns

It works well when your system behaves consistently and when the goal is cost control rather than detailed incident reconstruction.

The gap is visibility into rare issues. Because the decision happens upfront, slow requests and errors can get skipped.

Tail-Based Sampling: Outcome-Aware, With More Overhead

Tail-based sampling waits until the full trace completes before deciding what to keep. This enables rules such as:

  • Keep 100% of errors
  • Keep 100% of requests slower than 2 seconds
  • Keep 5% of everything else

This makes it ideal for debugging: decisions are made using the result, not guesswork.

How it works technically:
All spans from a trace must arrive at the same collector instance (usually by hashing the trace ID). That's why OTel collectors often run as StatefulSets with load-balanced exporters.

The trade-off:

  • Adds 10–30 seconds to the decision window
  • Requires memory to buffer in-flight traces
  • Needs dedicated infrastructure at high scale

Tail-based sampling improves clarity, but it increases operational complexity.

Both head- and tail-based sampling introduce bias in percentile calculations. If certain request types are dropped more frequently, P95 and P99 shift away from the actual user experience.

7 Tools That Help You with Full-Fidelity

1. Last9: Full-Fidelity, High-Cardinality Observability

Last9 is a telemetry data platform designed from the ground up to store full-fidelity telemetry at scale. You don't manage sampling rules or chase missing spans — everything your systems emit is captured, searchable, and correlated across logs, metrics, and traces.

We keep:

  • Zero sampling
  • Full attribute search
  • No dropped logs or traces

The platform comfortably handles 20 million active time series per metric per day. Last9 has proven this at scale during a major live streaming event with 50+ million concurrent viewers, demonstrating production-grade reliability under extreme load.

Shaping Telemetry Before It Hits Storage

A big part of how we keep full-fidelity practical is the pre-ingestion Control Plane. It lets you handle high-cardinality data and noise at ingestion — without touching your application code.

Streaming Aggregation
You can define PromQL-style rules that reshape high-cardinality metrics into more meaningful aggregates while still keeping raw data accessible.

Example: convert per-user metrics into per-region aggregates and bring millions of series down to hundreds.

LogMetrics & TraceMetrics
High-volume log streams and trace spans can be converted into compact time series in real time. You get the trendlines you need, without the cost of indexing everything.

Filtering & Routing Rules
Drop noisy telemetry, enrich attributes, apply regex-based filters, or route data to specific tiers. Everything is configurable at runtime.

Delta Temporality
For OpenTelemetry metrics, delta temporality avoids the memory growth that cumulative metrics cause when paired with high-cardinality labels.

Storage Designed for Scale

You write once — through a single endpoint — and the platform handles tiering based on your retention configuration:

  • Blaze: ~1 hour of real-time, low-latency access
  • Hot: 14–90 days of fast, exploratory access
  • Cold: long-term storage in S3

Each tier operates like a complete TSDB. Data moves automatically as it ages, and you never manage separate pipelines or change instrumentation depending on where it should land.

Event-Based Pricing That Mirrors How You Emit Telemetry

Pricing maps directly to what your systems produce:

1 event = 1 metric sample, 1 log line, or 1 trace span

Tiers include:

  • Free: 100M events/month with 7-day retention
  • Pro: unlimited events, 90-day metrics, 14-day logs/traces, full Control Plane

It keeps things predictable for teams that generate a lot of cardinality.

When Last9 Makes Sense

If you're operating high-cardinality, cloud-native infrastructure, and you want full-fidelity telemetry without wrestling with sampling, custom pipelines, or pre-aggregation, Last9 gives you a cost-efficient telemetry data platform that scales cleanly and a control plane that helps you reduce noise without losing meaningful detail.

Last9 removed the toil of setting up scalable, high cardinality monitoring—letting us focus on serving our users, not our dashboards.

- Ashish Garg, CTO, Probo

2. Datadog: Comprehensive but Expensive

Datadog continues to rely on head-based sampling as its primary strategy. Agents try to keep about 10 traces per second per agent, adjusting automatically as traffic changes. Live Search gives you all spans for the first 15 minutes, and after that window closes, sampling rules decide which spans stay indexed for the next 15 days.

To their credit, Datadog still computes RED metrics from 100% of incoming traffic, so core latency and error numbers reflect real behavior even when only a portion of traces are stored.

OpenTelemetry support is available (Agent v6.32.0+ for traces/metrics and v6.48.0+ for logs), but there are some practical limitations. OTLP-ingested data doesn't currently feed into Application Security Monitoring, the Continuous Profiler, or certain custom sampling rules. The broader architecture remains agent-centric, which means many of Datadog's advanced capabilities depend on using its agent rather than OTel pipelines.

Cardinality handling

Datadog's Metrics Without Limits™ feature gives teams the ability to control which tag combinations are indexed for querying, but each unique tag combination at ingestion still produces a new custom metric. Attributes with high variation—like customer IDs—are strongly discouraged in Datadog's best practices because they multiply the number of billable series exponentially.

Cost structure

Datadog's pricing spans several independent SKUs, and this is where things get complicated for most teams:

  • Infrastructure monitoring: $15–23 per host/month
  • APM: $31–40 per host/month
    • includes 150GB of spans + 1M indexed spans
    • overages billed at $0.10/GB
  • Logs: $0.10/GB ingestion + $1.70 / million events for indexing
  • Custom metrics: 100 included per host; additional metrics cost $1–5 per 100

In practice, a 100-host deployment with moderate APM usage and about 500GB of logs typically ends up in the $5,000–6,000/month range. Much of the mental overhead comes from juggling multiple line items: infrastructure, APM, logs, custom metrics, synthetics, and more — each billed separately.

When to choose Datadog

Datadog is a strong option if your priority is fast onboarding, a polished user experience, and a large set of integrations (800+). It tends to work best for mid-size engineering teams (50–500 engineers) who want an all-in-one platform and are comfortable paying a premium for convenience.

3. New Relic: APM-First Architecture with Scalable Cardinality

New Relic's APM starts with head-based adaptive sampling, and teams that need deeper visibility can enable Infinite Tracing, which adds tail-based sampling by sending 100% of traces to a trace observer. Their docs note that this mode can lead to noticeably higher egress traffic, so it's usually something teams enable with care.

One of New Relic's strengths is how much cardinality its backend can handle. Accounts support 15 million cardinality/day (unique attribute-map combinations) by default, with optional expansion up to 200 million. Per-metric cardinality limits are generous as well — 100,000 by default.

OpenTelemetry support is solid. New Relic offers native OTLP endpoints, W3C Trace Context, Prometheus remote_write, and ingestion paths for major cloud provider formats, making it easy to bring in diverse workloads without custom plumbing.

Pricing model

New Relic uses a data + user pricing structure:

Data ingestion

  • $0.30–0.35/GB for the standard tier
  • $0.55–0.60/GB for Data Plus
  • First 100GB/month is free

Users

  • Full Platform: $349/year ($29/mo annual) or $418.80/mo on-demand
  • Core users: $49/mo

Example calculation
For 1TB/month of data and 10 Full Platform users:
$561/month → (1000–100) × 0.30 + 10 × (349/12)

New Relic's schemaless event model is a long-standing differentiator — it allows flexible querying through NRQL without worrying about pre-aggregation or predefined dimensions.

When to choose New Relic

New Relic is a good fit for teams working with high-cardinality workloads that need strong APM capabilities and prefer a data/user-based pricing model rather than host-based billing. It works especially well when you want predictable ingestion pricing and fast, flexible queries across many dimensions.

4. Grafana Cloud: Open Standards at Lower Cost

Grafana Cloud stays close to open standards and gives you full control over how much telemetry you ingest. The platform doesn't enforce sampling at all — the decision sits entirely in your OpenTelemetry SDK configuration or your Grafana Alloy collector settings. If you want 100% ingestion, you can run it that way.

The storage layer uses familiar components from the open-source world: Mimir for metrics, Loki for logs, and Tempo for traces. This keeps the system approachable for teams already comfortable with the Prometheus/Loki/Grafana ecosystem.

One feature teams rely on is Adaptive Metrics. It looks at 30 days of usage, identifies metrics that aren't queried often, and suggests safe aggregation rules. In most setups, this trims 20–50% of the volume, while keeping dashboards and alerts intact.

Cardinality management

Grafana Cloud defines an "active series" as any series that produced data in the last 20 minutes. Loki takes a more guarded approach to indexing by enforcing a 15-label default limit, pushing high-cardinality data into log lines through Structured Metadata.

The platform also includes Cardinality Management dashboards, which surface your busiest metrics and help you understand where series count is growing.

Pricing

The free tier gives you 10,000 active series, 50GB of logs/traces, and 14-day retention. Most teams that grow beyond that end up on the Pro tier, which is priced as:

  • $6.50 per 1,000 active series
  • $0.50/GB for logs
  • $0.30/GB for traces
  • $8 per user, plus a $19 platform fee

A common setup — roughly 50,000 active series, 500GB logs, 100GB traces, and 8 users — usually lands around $623/month.

When to choose Grafana Cloud

Grafana Cloud works well when you want an open, standards-aligned stack and are comfortable owning some operational pieces. Many teams pick it for its strong cost advantage over Datadog and for the ability to keep their telemetry pipelines vendor-neutral.

5. Calyptia / Fluent Bit: High-Throughput Telemetry Shipping with Smart Filtering

Fluent Bit — along with Calyptia's enterprise distribution — is often the first hop for teams that want to ship full-fidelity telemetry without slowing down their nodes. It handles logs, metrics, and traces, and supports more than 100 input and output plugins, which makes it easy to slot into almost any environment.

Because the core engine is written in C, it runs with a very small memory footprint (typically under 1MB) and can push hundreds of thousands of events per second. That efficiency is a big reason it's common in clusters where you want to capture everything but still keep the collector light.

Fluent Bit's feature set is built around shaping telemetry as early as possible:

  • filtering at the edge using regex, Lua, and routing rules
  • structured parsing for JSON, logfmt, and Kubernetes log formats
  • forwarding to any backend — Elasticsearch, ClickHouse, S3, Kafka, OpenTelemetry collectors, and many more
  • multi-destination routing, so different data subsets go to different systems
  • record modification for enriching tags or cleaning up noisy fields

In full-fidelity environments, this early shaping helps you retain all your signals while avoiding unnecessary load on downstream collectors and storage.

When to choose Calyptia / Fluent Bit

Calyptia and Fluent Bit are a strong fit when you need a high-throughput, low-footprint collector with flexible routing and early filtering — especially in Kubernetes-heavy setups where cost and performance depend on what you drop or transform before the data reaches your backend.

6. Elastic Observability: Full-Fidelity via a Search-First Architecture

Elastic Observability builds on Elasticsearch, which handles logs, metrics, and traces alongside Elastic APM agents. Because Elasticsearch is fundamentally a search engine, it works well for full-fidelity ingestion, large index volumes, and fast queries across high-cardinality fields. The design naturally supports storing everything and drilling into any dimension without needing predefined schemas.

Key features include:

  • schema-flexible ingestion for logs and traces
  • ECS (Elastic Common Schema) for consistent attributes
  • Index Lifecycle Management (ILM) to move data through hot → warm → cold → frozen tiers
  • searchable snapshots for low-cost long-term retention
  • machine learning jobs for anomaly detection

Cardinality handling

Elasticsearch can index high-cardinality fields effectively, but hot indices can become expensive as data grows. Warm and frozen tiers, combined with S3-backed searchable snapshots, make long-term, full-fidelity storage far more practical. Teams often keep recent data hot for speed and use snapshots for the majority of their retention window.

Pricing model

Elastic uses a resource-based pricing model, so cost scales with compute, RAM, and storage rather than per-event or per-host billing. Hot tiers tend to be the most expensive part of the setup, while warm/frozen tiers — especially when backed by object storage — offer a significantly cheaper path for long-term data.

When to choose Elastic Observability

Elastic is a strong choice for teams with heavy log workloads or organizations already invested in the Elastic Stack. The search-first architecture fits use cases that require full-fidelity data, flexible indexing, and fast, ad-hoc exploration across large datasets.

7. Lightstep (ServiceNow): Distributed Tracing with Emphasis on Accuracy

Lightstep approaches observability with a tracing-first design. The platform samples at ingest using Satellite processes, which act as collectors deployed close to your services. These Satellites analyze traffic locally, apply tail-based rules, and forward only the traces that match your policies. This design helps capture meaningful data while keeping the central infrastructure lighter.

Because Satellites evaluate the entire trace before deciding, Lightstep can preserve rare or unusual events even when normal traffic is heavily sampled. The platform also computes service diagrams, dependency maps, and latency path breakdowns directly from trace relationships, giving teams a precise view of where performance shifts originate.

Lightstep uses a combination of:

  • tail-based sampling with policy evaluation
  • local Satellites for preprocessing and load distribution
  • distributed snapshots to compare performance between versions, deploys, and time windows
  • auto-generated service diagrams and dependency analysis
  • full OpenTelemetry-native ingestion and export

The system is built to show "what changed and why" using version-aware analysis, allowing teams to tie regressions to specific deploys or configuration updates.

Cardinality handling

Because Satellites handle sampling and attribute filtering before data reaches the backend, Lightstep manages high-cardinality attributes by evaluating them locally. This reduces the indexing pressure on the centralized platform while still capturing outliers and unusual requests.

Pricing model

Lightstep's pricing focuses on three dimensions:

  • retention (days of trace data)
  • ingestion throughput (events, spans, and metrics per minute)
  • number of services

This gives teams predictable costs even as infrastructure scales. Lower retention + high ingest is one of the more common setups for large microservices deployments.

When to choose Lightstep

Lightstep works well for organizations that rely heavily on distributed tracing and need high-quality visibility during deploys and performance regressions. It's especially helpful when you want tail-based sampling with clear version-to-version comparison, or when you want a tracing-first workflow aligned closely with OpenTelemetry.

Data Patterns in Cloud-Native Environments

Cardinality expansion in cloud-native systems

As you move from a monolith to microservices, the amount of telemetry your system emits naturally increases. More services, richer labels, and a more dynamic environment all contribute to a larger time-series footprint.

A basic example makes this clear:

  • An HTTP request metric with method (10 values), status code (50 values), and region (20 values) produces 10,000 unique series
  • Adding the service name across 100 microservices brings that to 1 million
  • Adding container or pod identifiers increases the count further, as each pod restart creates a new series

Prometheus notes that keeping cardinality modest can help with ingestion and query performance, but cloud-native systems often operate comfortably well beyond the 1 million mark. Kubernetes adds many helpful dimensions — and with them, a richer picture of how your system behaves. It's normal for modern workloads to cross 100 million active series.

Missing important signals in large telemetry sets

Teams often tune or shape ingestion to keep their systems predictable and manageable. Industry research shows that 98% of organizations limit data ingestion to control observability costs, with 82% frequently limiting log volume.

These decisions usually come from practical needs — governance requirements, cost planning, and keeping pipelines stable. They also create some trade-offs, according to the same research:

  • governance or compliance challenges (47%)
  • additional time spent preparing or curating data (47%)
  • coordination across teams (42%)
  • fewer insights available during active debugging (38%)

Many organizations find that a significant portion of collected telemetry goes unused in day-to-day operations. This isn't necessarily a waste; it's the result of modern systems generating more signals than teams actively query. The effect becomes more noticeable during an incident, when engineers must context-switch between:

  • multiple observability tools
  • different data formats
  • dashboards that weren't designed together
  • separate timelines and views

In these situations, the signal you need often exists — it's just sitting among a large volume of healthy request paths.

A common way engineers describe this is: most telemetry reflects normal behavior, and the real challenge is surfacing the few meaningful signals during an outage — not because the data is missing, but because of how distributed systems generate and store it.

Cost Control Strategies That Preserve Visibility

Teams running large, distributed systems often lean on a handful of proven patterns to keep costs in check without giving up the visibility they rely on during incidents. None of these patterns are shortcuts—they're practical ways to shape telemetry so it stays useful and efficient as systems scale.

Pattern 1: Tail-based sampling with outcome-oriented rules

Tail-based sampling waits until a trace is complete before deciding whether to keep it. This lets teams preserve the important traffic automatically: every error, every slow request, and every call to critical endpoints like payments or authentication. The rest—usually healthy traffic—is sampled at a much lighter rate.

In most setups, that combination brings stored volume down to about 10–20% of the original while still capturing the signals that matter.

Pattern 2: Streaming aggregation at ingestion

Another common approach is to reshape high-cardinality metrics during ingestion. Instead of storing per-user or per-container metrics directly, teams convert them into broader aggregates such as per-region, per-service, or per-node summaries.

Platforms like Uber's M3 and Last9's Control Plane apply these PromQL-style transformations in real time, significantly reducing overall volume while keeping raw data available when deeper investigation is needed.

Pattern 3: Multi-tier storage with rehydration

Tiered storage is now standard across observability platforms. Recent data lives in a hot tier where queries are fast; older data moves to object storage but can still be rehydrated on demand. This model provides full-fidelity historical access at a fraction of the cost.

The difference is striking: hot tiers run around $0.10–2.00/GB, while S3 sits at $0.023/GB, a 4–87× reduction for long-term retention.

Pattern 4: Logs-to-metrics conversion

High-volume logs often tell the same story over and over. Converting these into metrics preserves the trendline without storing every individual entry.

For example, a service emitting a million "request completed" logs per hour can be represented as a single counter producing just 60 data points per hour. That's a 99.994% reduction, and the data remains perfectly usable for alerting and dashboards.

Pattern 5: Cardinality-aware instrumentation

Instrumentation choices also play a big role in how telemetry scales. Using route patterns instead of full URLs, shifting identifiers like session IDs into traces instead of metrics, or bucketing continuous values such as latency all help keep series counts predictable. Setting limits on label values ensures that the data stays meaningful without ballooning into unnecessary combinations.

How Last9 Delivers Full-Fidelity Without the Cost Trade-Off

Most observability platforms force you to choose: capture everything and pay for it, or sample aggressively and miss critical signals. Last9's architecture eliminates this trade-off.

Last9 takes a different approach by treating full-fidelity as the default, not an add-on. Our platform shapes telemetry at ingestion through a Control Plane that handles aggregation, filtering, and logs-to-metrics without touching application code.

Everything is written once to a single endpoint and stored across Blaze, Hot, and Cold tiers for fast queries and long-term retention. Pricing stays predictable because you pay per event — not per host, metric, or label — so visibility grows with your traffic, not your cardinality.

If you're evaluating observability platforms and full-fidelity visibility matters to your team, explore how Last9's architecture eliminates the sampling vs cost trade-off.

Last9 stood out in our rigorous PoC evaluations due to its reliability-first observability model, cost efficiency, and engineering-friendly interface.

Setup was quick, and the platform fits seamlessly into our workflow. Last9's customer support is responsive and helpful, which makes adoption smooth.
- Ganesh Konkimalla, Software Developer, Tazapay

FAQs

Can I achieve true full-fidelity telemetry without bankrupting my team?
Yes, but it requires architectural choices. Use columnar storage backends (ClickHouse, Druid, InfluxDB IOx) that handle high cardinality efficiently.

Implement three-tier storage where only recent data lives in expensive hot storage, while everything gets archived to S3. Apply streaming aggregation at ingestion to reduce cardinality before storage while preserving raw data for targeted queries.

How do I decide between head-based and tail-based sampling?
Use head-based sampling when you have predictable, well-understood systems with modest traffic and can tolerate some blind spots. Use tail-based sampling when you need to capture 100% of errors and slow requests while sampling normal traffic — this is outcome-aware sampling that preserves critical signals.

The best production approach combines both: head-based for initial cost control, tail-based with intelligent policies for debugging, plus full-fidelity archives with on-demand rehydration for forensics.

What's the difference between traditional TSDBs and columnar storage for observability?
Traditional time-series databases store each unique time series with its own index entry and metadata in memory. This hits performance degradation around 1 million series, requiring 50GB+ just for indexing at 100 million series.

Columnar databases store each field and tag separately in columns, enabling 10–100x compression ratios and maintaining query performance at 20M+ series per metric. The architectural advantage is fundamental — columnar databases are built for OLAP analytical workloads, which is exactly what observability queries are.

How do I prevent cardinality explosion in Kubernetes?
Prevent unbounded cardinality at instrumentation: replace full URLs with route templates, drop pod IDs and container IDs from metric labels, and use service-level tags instead of instance-level.

Configure label cardinality limits — Kubernetes with DD_CHECKS_TAG_CARDINALITY set to "low" provides basic tags, while "high" includes all dimensions. Implement streaming aggregation to convert per-pod metrics to per-service aggregates before storage.

Set up cardinality budgets per service and alert when exceeded. A single instrumentation change, adding userAgent as a label, can create millions of new series — proactive management prevents surprise bills.

Which observability platform should I choose for my team?
Choose based on your constraints: Datadog if you prioritize ease of use and comprehensive integrations over cost (budget $5,000–25,000 monthly for 50–200 hosts). New Relic, if you need deep high-cardinality analysis with simplified user-based pricing ($2,000–15,000 monthly for 1–5TB with 5–20 Full Platform users).

Grafana Cloud, if you're cost-conscious and committed to open standards ($500–5,000 monthly, 6–9x cheaper than Datadog). Last9 if you require true full-fidelity telemetry at high cardinality with event-based pricing that doesn't penalize detailed instrumentation.

Authors
Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I

Contents

Do More with Less

Unlock unified observability and faster triaging for your team.