ETL vs ELT for Observability: Technical Approaches and Practical Tradeoffs

In the world of monitoring software, how you process telemetry data can significantly impact your ability to derive insights, troubleshoot issues, and manage costs.

There are 2 primary use cases for how telemetry data is leveraged:

Radar (Monitoring of systems) usually falls into the bucket of known knowns and known unknowns. This leads to scenarios where some data is almost 'pre-determined' to behave, be plotted in a certain way - because we know what we are looking for.
Blackbox (Debugging, RCA etc.) ones on the other hand are more to do with unknown unknowns. Which entails what we don't know and may need to hunt for to build an understanding of the system.

Understanding Telemetry Data Challenges

Before diving into processing approaches, it's important to understand the unique challenges of telemetry data:

Volume: Modern systems generate enormous amounts of telemetry data
Velocity: Data arrives in continuous, high-throughput streams
Variety: Multiple formats across metrics, logs, traces, profiles, and events
Time-sensitivity: Value often decreases with age
Correlation needs: Data from different sources must be linked together

These characteristics create special considerations when choosing between ETL and ELT approaches.

ETL for Telemetry: Transform-First Architecture

Technical Architecture

In an ETL approach, telemetry data transforms before reaching its final destination:

A typical implementation stack might include:

Collection: OpenTelemetry, Prometheus, Fluent Bit
Transport: Kafka or Kinesis or in-memory as the buffering layer
Transformation: Stream processing
Storage: Time-series databases (Prometheus) or specialized indices or Object Storage (s3)

Key Technical Components

1. Aggregation Techniques

Pre-aggregation significantly reduces data volume and query complexity. A typical pre-aggregation flow looks like this:

This transformation condenses raw data into 5-minute summaries, dramatically reducing storage requirements and improving query performance.

Example: For a gaming application handling millions of requests per day, raw request latency metrics (potentially billions of data points) can be grouped by service and endpoint, then aggregated into 5-minute (or 1-minute) windows.

A single API call that generates 100 latency data points per second (8.64 million per day) is reduced to just 288 aggregated entries per day (one per 5-minute window), while still preserving critical p50/p90/p99 percentiles needed for SLA monitoring.

2. Cardinality Management

High-cardinality metrics can break time-series databases. The cardinality management process follows this pattern:

Effective strategies include:

Label filtering and normalization
Cardinality budgets
Strategic aggregation of specific dimensions
Hashing techniques for high-cardinality values while preserving query patterns

Example: A microservice tracking HTTP requests includes user IDs and request paths in its metrics. With 50,000 daily active users and thousands of unique URL paths, this creates millions of unique label combinations.

The cardinality management system filters out user IDs entirely (configurable, too high cardinality), normalizes URL paths by replacing dynamic segments with placeholders (e.g., /users/123/profile becomes /users/{id}/profile), and applies consistent categorization to errors. This reduces unique time series from millions to hundreds, allowing the time-series database to function efficiently.

💡

If you're working with telemetry data, high cardinality can be tricky. Learn how it impacts performance in this guide.

3. Real-time Enrichment

Adding context to metrics during the transformation phase involves integrating external data sources:

This process adds critical business and operational context to raw telemetry data, enabling more meaningful analysis and alerting based on service importance, customer impact, and other factors beyond pure technical metrics.

Example: A payment processing service emits basic metrics like request counts, latencies, and error rates. The enrichment pipeline joins this telemetry with service registry data to add metadata about the service tier (critical), SLO targets (99.99% availability), and team ownership (payments-team).

It then incorporates business context to tag transactions with their type (subscription renewal, one-time purchase, refund) and estimated revenue impact. When an incident occurs, alerts are automatically prioritized based on business impact rather than just technical severity, and routed to the appropriate team with rich context.

Technical Advantages

Query performance: Pre-calculated aggregates eliminate computation at query time
Predictable resource usage: Both storage and query compute are controlled
Schema enforcement: Data conformity is guaranteed before storage
Optimized storage formats: Data can be stored in formats optimized for specific access patterns

Technical Limitations

Loss of granularity: Some detail is permanently lost
Schema rigidity: Adapting to new requirements requires pipeline changes
Processing overhead: Real-time transformation adds complexity and resource demands
Transformation-time decisions: Analysis paths must be known in advance

💡

If you're using OpenTelemetry, knowing how metrics aggregation works can help. Check out this guide to learn more.