Last9 Last9

Mar 13th, ‘25 / 9 min read

ETL vs ELT for Observability: Technical Approaches and Practical Tradeoffs

Explore the key differences between ETL and ELT for observability data, their technical approaches, and the tradeoffs that impact performance and cost.

ETL vs ELT for Observability: Technical Approaches and Practical Tradeoffs

In the world of monitoring software, how you process telemetry data can significantly impact your ability to derive insights, troubleshoot issues, and manage costs.

There are 2 primary use cases for how telemetry data is leveraged:

  • Radar (Monitoring of systems) usually falls into the bucket of known knowns and known unknowns. This leads to scenarios where some data is almost 'pre-determined' to behave, be plotted in a certain way - because we know what we are looking for.
  • Blackbox (Debugging, RCA etc.) ones on the other hand are more to do with unknown unknowns. Which entails what we don't know and may need to hunt for to build an understanding of the system.
Software Observability from the Lens of Radar and a Black Box | Last9
Observability is often a misunderstood and misused term. It has come to mean nothing and everything at this point. Read more on how Observability can be viewed from the lens of a Radar and a Black Box.

Understanding Telemetry Data Challenges

Before diving into processing approaches, it's important to understand the unique challenges of telemetry data:

  • Volume: Modern systems generate enormous amounts of telemetry data
  • Velocity: Data arrives in continuous, high-throughput streams
  • Variety: Multiple formats across metrics, logs, traces, profiles, and events
  • Time-sensitivity: Value often decreases with age
  • Correlation needs: Data from different sources must be linked together

These characteristics create special considerations when choosing between ETL and ELT approaches.

ETL for Telemetry: Transform-First Architecture

Technical Architecture

In an ETL approach, telemetry data transforms before reaching its final destination:

ETL for Telemetry
ETL for Telemetry

A typical implementation stack might include:

  • Collection: OpenTelemetry, Prometheus, Fluent Bit
  • Transport: Kafka or Kinesis or in-memory as the buffering layer
  • Transformation: Stream processing
  • Storage: Time-series databases (Prometheus) or specialized indices or Object Storage (s3)

Key Technical Components

1. Aggregation Techniques

Pre-aggregation significantly reduces data volume and query complexity. A typical pre-aggregation flow looks like this:

Aggregation Techniques
Aggregation Techniques

This transformation condenses raw data into 5-minute summaries, dramatically reducing storage requirements and improving query performance.

Example: For a gaming application handling millions of requests per day, raw request latency metrics (potentially billions of data points) can be grouped by service and endpoint, then aggregated into 5-minute (or 1-minute) windows.

A single API call that generates 100 latency data points per second (8.64 million per day) is reduced to just 288 aggregated entries per day (one per 5-minute window), while still preserving critical p50/p90/p99 percentiles needed for SLA monitoring.

2. Cardinality Management

High-cardinality metrics can break time-series databases. The cardinality management process follows this pattern:

Cardinality Management
Cardinality Management

Effective strategies include:

  • Label filtering and normalization
  • Cardinality budgets
  • Strategic aggregation of specific dimensions
  • Hashing techniques for high-cardinality values while preserving query patterns

Example: A microservice tracking HTTP requests includes user IDs and request paths in its metrics. With 50,000 daily active users and thousands of unique URL paths, this creates millions of unique label combinations.

The cardinality management system filters out user IDs entirely (configurable, too high cardinality), normalizes URL paths by replacing dynamic segments with placeholders (e.g., /users/123/profile becomes /users/{id}/profile), and applies consistent categorization to errors. This reduces unique time series from millions to hundreds, allowing the time-series database to function efficiently.

💡
If you're working with telemetry data, high cardinality can be tricky. Learn how it impacts performance in this guide.

3. Real-time Enrichment

Adding context to metrics during the transformation phase involves integrating external data sources:

Real-time Enrichment
Real-time Enrichment

This process adds critical business and operational context to raw telemetry data, enabling more meaningful analysis and alerting based on service importance, customer impact, and other factors beyond pure technical metrics.

Example: A payment processing service emits basic metrics like request counts, latencies, and error rates. The enrichment pipeline joins this telemetry with service registry data to add metadata about the service tier (critical), SLO targets (99.99% availability), and team ownership (payments-team).

It then incorporates business context to tag transactions with their type (subscription renewal, one-time purchase, refund) and estimated revenue impact. When an incident occurs, alerts are automatically prioritized based on business impact rather than just technical severity, and routed to the appropriate team with rich context.

Technical Advantages

  1. Query performance: Pre-calculated aggregates eliminate computation at query time
  2. Predictable resource usage: Both storage and query compute are controlled
  3. Schema enforcement: Data conformity is guaranteed before storage
  4. Optimized storage formats: Data can be stored in formats optimized for specific access patterns

Technical Limitations

  1. Loss of granularity: Some detail is permanently lost
  2. Schema rigidity: Adapting to new requirements requires pipeline changes
  3. Processing overhead: Real-time transformation adds complexity and resource demands
  4. Transformation-time decisions: Analysis paths must be known in advance
💡
If you're using OpenTelemetry, knowing how metrics aggregation works can help. Check out this guide to learn more.

ELT for Telemetry: Raw Storage with Flexible Transformation

Technical Architecture

ELT architecture prioritizes getting raw data into storage, with transformations performed at query time:

ELT for Telemetry
ELT for Telemetry

A typical implementation might include:

  • Collection: OpenTelemetry, Prometheus, Fluent Bit
  • Transport: Direct ingestion without complex processing
  • Storage: Object storage (S3, GCS) or data lakes in Parquet format
  • Transformation: SQL engines (Presto, Athena), Spark jobs, or specialized OLAP systems

Key Technical Components

1. Efficient Raw Storage

Optimizing for long-term storage of raw telemetry requires careful consideration of file formats and storage organization:

Efficient Raw Storage
Efficient Raw Storage

This approach leverages columnar storage formats like Parquet with appropriate compression (ZSTD for traces, Snappy for metrics), dictionary encoding, and optimized column indexing based on common query patterns (trace_id, service, time ranges).

Example: A cloud-native application generates 10TB of trace data daily across its distributed services. Instead of discarding or heavily sampling this data, the complete trace information is captured using OpenTelemetry collectors and converted to Parquet format with ZSTD compression.

Key fields like trace_id, service name, and timestamp are indexed for efficient querying. This approach reduces the storage footprint by 85% compared to raw JSON while maintaining query performance.

When a critical customer-impacting issue occurred, engineers were able to access complete trace data from 3 months prior, identifying a subtle pattern of intermittent failures that would have been lost with traditional sampling.

2. Partitioning Strategies

Effective partitioning is crucial for query performance against raw telemetry. A well-designed partitioning strategy follows this hierarchy:

Partitioning Strategies
Partitioning Strategies

This partitioning approach enables efficient time-range queries while also allowing filtering by service and tenant, which are common query dimensions. The partitioning strategy is designed to:

  1. Optimize for time-based retrieval (most common query pattern)
  2. Enable efficient tenant isolation for multi-tenant systems
  3. Allow service-specific queries without scanning all data
  4. Separate telemetry types for optimized storage formats per type

Example: A SaaS platform with 200+ enterprise customers uses this partitioning strategy for its observability data lake.

When a high-priority customer reports an issue that occurred last Tuesday between 2-4 pm, engineers can immediately query just those specific partitions: /year=2023/month=11/day=07/hour=1[4-5]/tenant=enterprise-x/*.

This approach reduces the scan size from potentially petabytes to just a few gigabytes, enabling responses in seconds rather than hours.

When comparing current performance against historical baselines, the time-based partitioning allows efficient month-over-month comparisons by scanning only the relevant time partitions.

💡
Profiling in OpenTelemetry helps you see where performance issues hide. Check out this guide to learn more.

3. Query-time Transformations

SQL and analytical engines provide powerful query-time transformations. The query processing flow for on-the-fly analysis looks like this:

Query-time Transformations
Query-time Transformations

This query flow demonstrates how complex analysis like calculating service latency percentiles, error rates, and usage patterns can be performed entirely at query time without needing pre-computation.

The analytical engine applies optimizations like predicate pushdown, parallel execution, and columnar processing to achieve reasonable performance even against large raw datasets.

Example: A DevOps team investigating a performance regression discovered it only affected premium customers using a specific feature.

Using query-time transformations against the ELT data lake, they wrote a single query that first filtered to the affected period, joined customer tier information, extracted relevant attributes about feature usage, calculated percentile response times grouped by customer segment, and identified that premium customers with high transaction volumes were experiencing degraded performance only when a specific optional feature flag was enabled.

This analysis would have been impossible with pre-aggregated data since the customer segment + feature flag dimension hadn't been previously identified as important for monitoring.

Last9 Log Explorer
Last9 Log Explorer

Technical Advantages

  1. Full data fidelity: No information is lost in pre-processing
  2. Schema flexibility: New dimensions can be analyzed without pipeline changes
  3. Cost-effective storage: Object storage is significantly cheaper than specialized DBs
  4. Retroactive analysis: Historical data can be examined with new perspectives

Technical Limitations

  1. Query performance challenges: Interactive analysis may be slow on large datasets
  2. Resource-intensive analysis: Compute costs can be high for complex queries
  3. Implementation complexity: Requires more sophisticated query tooling
  4. Storage overhead: Raw data consumes significantly more space
💡
Environment variables can make OpenTelemetry setup smoother. Here's a guide to help: OpenTelemetry Environment Variables.

Technical Implementation: The Hybrid Approach

Core Architecture Components

Hybrid Telemetry Architecture
Hybrid Telemetry Architecture

Implementation Strategy

1. Dual-path processing

Dual-path processing
Dual-path processing

Example: A global ride-sharing platform implemented a dual-path telemetry system that routes service health metrics and customer experience indicators (ride wait times, ETA accuracy) through the ETL path for real-time dashboards and alerting.

Meanwhile, all raw data including detailed user journeys, driver activities, and application logs flow through the ELT path to cost-effective storage. When a regional outage occurred, operations teams used the real-time dashboards to quickly identify and mitigate the immediate issue.

Later, data scientists used the preserved raw data to perform a comprehensive root cause analysis, correlating multiple factors that wouldn't have been visible in pre-aggregated data alone.

2. Smart data routing

Smart data routing
Smart data routing

Example: A financial services company deployed a smart routing system for their telemetry data. All data is preserved in the data lake, but critical metrics like transaction success rates, fraud detection signals, and authentication service health metrics are immediately routed to the real-time processing pipeline.

Additionally, any security-related events such as failed login attempts, permission changes, or unusual access patterns are immediately sent to a dedicated security analysis pipeline.

During a recent security incident, this routing enabled the security team to detect and respond to an unusual pattern of authentication attempts within minutes, while the complete context of user journeys and application behavior was preserved in the data lake for subsequent forensic analysis.

3. Unified query interface

Unified query interface
Unified query interface

Real-world Implementation Example

A specific engineering implementation at last9.io demonstrates how this hybrid approach works in practice:

For a large-scale Kubernetes platform with hundreds of clusters and thousands of services, we implemented a hybrid telemetry pipeline with:

  1. Critical-path metrics processed through a pipeline that:
    • Performs dimensional reduction (limiting label combinations)
    • Pre-calculates service-level aggregations
    • Computes derived metrics like success rates and latency percentiles
  2. Raw telemetry stored in a cost-effective data lake:
    • Partitioned by time, data type, and tenant
    • Optimized for typical query patterns
    • Compressed with appropriate codecs (Zstd for traces, Snappy for metrics)
  3. Unified query layer that:
    • Routes dashboard and alerting queries to pre-aggregated storage
    • Redirects exploratory and ad-hoc analysis to the data lake
    • Manages correlation queries across both systems

This approach delivered both the query performance needed for real-time operations and the analytical depth required for complex troubleshooting.

Decision Framework

When architecting telemetry pipelines, these technical considerations should guide your approach:

Decision Factor

Use ETL

Use ELT

Query latency requirements

< 1 second

Can wait minutes

Data retention needs

Days/Weeks

Months/Years

Cardinality

Low/Medium

Very high

Analysis patterns

Well-defined

Exploratory

Budget priority

Compute

Storage

Conclusion

The technical realities of telemetry data processing demand thinking beyond simple ETL vs. ELT paradigms. Engineering teams should architect tiered systems that leverage the strengths of both approaches:

  1. ETL-processed data for operational use cases requiring immediate insights
  2. ELT-processed data for deeper analysis, troubleshooting, and historical patterns
  3. Metadata-driven routing to intelligently direct queries to the appropriate tier

This engineering-centric approach balances performance requirements with cost considerations while maintaining the flexibility required in modern observability systems.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Nishant Modak

Nishant Modak

Founder at Last9. Loves building dev tools and listening to The Beatles.

X
Aditya Godbole

Aditya Godbole

CTO at Last9