Guide — How to Debug OpenTelemetry Pipelines

You’ve deployed OpenTelemetry across your stack — traces, metrics, and logs flow through the Collector to your observability backend. Most days, everything runs fine. Until the day it doesn’t.

When production breaks at 2 AM and your telemetry pipeline goes dark — right when you need insights most — how do you debug it? When spans disappear, metrics get dropped, or the Collector crashes under load, the cause isn’t always clear. You’re left wondering if the issue lies in your application instrumentation, Collector configuration, network, or backend.

In this part of the OTel series, we focus on systematic ways to debug OpenTelemetry pipelines: common failure modes, practical debugging workflows, and tools that help you pinpoint issues faster.

The Observability Paradox: Monitoring Your Monitoring

When the OpenTelemetry Collector fails, it can take your entire observability stack with it. Suddenly, there’s no data from your applications—right when you need it most.

A solution to this is to make the Collector observable as well. It’s the classic “who watches the watchers” problem. The Collector should emit its own telemetry: metrics that track data flow, logs that capture errors, and health endpoints that surface its status. Without that signal, troubleshooting becomes guesswork.

By default, the Collector exposes internal metrics on port 8888 in Prometheus format, including:

otelcol_receiver_accepted_spans: incoming data
otelcol_exporter_sent_spans: outgoing data
otelcol_exporter_queue_size: exporter queue utilization

These metrics help you confirm whether the Collector is running as expected and where data might be stuck.

The key shift is to treat your telemetry pipeline like any other production system—with its own dashboards, alerts, and reliability targets.

Debugging Toolkit by OTel

OpenTelemetry includes several built-in tools to help troubleshoot your setup. Let’s understand them:

Debug Exporter

The most straightforward one is the debug exporter, which prints telemetry data directly to the console. It’s a quick way to verify that the Collector is receiving and processing data correctly—no backend required.

exporters:
  debug:
    verbosity: detailed # basic | normal | detailed

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug] # Outputs to console

Verbosity levels:

basic: Shows a single-line summary with record counts
normal: Displays one line per telemetry record
detailed: Prints complete details for every record

Start with the debug exporter when you’re setting up a new pipeline, testing configuration changes, or investigating data transformation issues. It’s essentially “printf debugging” for OpenTelemetry—fast feedback without depending on external systems.

The debug exporter replaces the deprecated logging exporter starting from version v0.111.0. If you’re using older configurations, update logging to debug.

Internal telemetry

By default, the OTel collector exposes Prometheus metrics on port 8888, giving you visibility into how the Collector itself is performing. This self-monitoring layer is crucial for production setups because it helps you spot issues before they cascade into lost data.

What to Look For

The Collector’s internal metrics tell a story about your pipeline’s health:

Data flow: Metrics like otelcol_receiver_accepted_* and otelcol_exporter_sent_* confirm data is being received and sent correctly. A rise in otelcol_receiver_refused_* or otelcol_exporter_send_failed_* usually means something’s stuck.
Queue health: Watch otelcol_exporter_queue_size and otelcol_exporter_queue_capacity to see if your exporter queues are filling up. Frequent otelcol_exporter_enqueue_failed_* signals dropped data.
Batch performance: The otelcol_processor_batch_* metrics show how batches are sent — whether because they’ve reached their size or timeout thresholds.
Resource usage: otelcol_process_memory_rss, otelcol_process_cpu_seconds_total, and otelcol_process_runtime_heap_alloc_bytes reveal how much system memory and CPU the Collector consumes over time.

Here’s how you can enable it:

Internal telemetry is configured in the service.telemetry section of the Collector config.

service:
  telemetry:
    metrics:
      level: normal # none | basic | normal | detailed
      readers:
        - pull:
            exporter:
              prometheus:
                host: "0.0.0.0"
                port: 8888
    logs:
      level: info
      encoding: json

This setup exposes metrics on port 8888. You can then scrape these metrics using Prometheus or even configure the Collector to monitor itself.

For example, to send these internal metrics to your backend:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: otel-collector-internal
          scrape_interval: 10s
          static_configs:
            - targets: ["localhost:8888"]

exporters:
  otlp:
    endpoint: monitoring-backend:4317

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [otlp]

With this setup, the Collector becomes self-aware — it monitors its own throughput, queue depth, and resource usage. That’s often the first clue when something starts to degrade silently.

zPages

The zPages extension gives you a live, web-based view of what the Collector is doing in real time. It’s one of the fastest ways to understand what’s happening inside your telemetry pipeline—no logs, no dashboards, just direct insight into active operations.

extensions:
  zpages:
    endpoint: 0.0.0.0:55679 # Expose on all interfaces for containers

service:
  extensions: [zpages]

Once enabled, you can open the TraceZ interface at http://localhost:55679/debug/tracez to:

Spot spans that never finish — often signs of deadlocks or missing endSpan() calls.
Identify slow operations contributing to latency.
Review error types and counts as they occur.
Drill into specific traces to understand Collector processing behavior.

The zPages extension is especially useful during live troubleshooting, when you need to inspect the Collector’s behavior without exporting data elsewhere. It gives you a direct window into the spans being processed and helps confirm whether the Collector is stalled, overloaded, or simply waiting for data.

Health check extension

The health_check extension provides a simple HTTP endpoint that reports the Collector’s status. It’s useful for ensuring your telemetry pipeline stays healthy and for integrating with monitoring or orchestration systems.

extensions:
  health_check:
    endpoint: 0.0.0.0:13133
    path: "/health/status"
    check_collector_pipeline:
      enabled: true
      interval: "5m"
      exporter_failure_threshold: 5

service:
  extensions: [health_check]

Once enabled, you can access the health endpoint at http://localhost:13133/.

Use this extension for:

Kubernetes probes: Configure liveness and readiness checks against the endpoint.
Load balancers: Route traffic only to healthy Collectors.
Monitoring systems: Trigger alerts when the Collector reports unhealthy status.

Configuration Validation

Before deploying any configuration changes, validate them using the built-in command:

otelcol validate --config=/path/to/config.yaml

This command checks for syntax errors, missing fields, and invalid component references. It’s a simple way to prevent bad configs from reaching production—ideal for running in CI/CD pipelines.

Listing Available Components

You can also inspect which components your Collector build supports:

otelcol components

This lists all receivers, processors, exporters, and extensions included in your current Collector distribution, along with their stability levels (development, alpha, beta, or stable). It’s a quick sanity check before adding a new component to your configuration.

Testing tools

The easiest way to test your telemetry pipeline is with telemetrygen, the official OpenTelemetry tool for generating test traces, metrics, and logs. It lets you simulate load and verify that your Collector is configured correctly—no application instrumentation required.

# Generate test traces
docker run --network host \
  ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest \
  traces \
  --otlp-endpoint localhost:4317 \
  --otlp-insecure \
  --duration 30s \
  --rate 10

This command sends 10 traces per second for 30 seconds to your Collector.

Use telemetrygen to:

Test new Collector configurations
Validate pipeline transformations and data routing
Load test Collectors before production rollout
Reproduce pipeline issues in isolation

It’s especially handy when you want to test how the Collector behaves under load or after a configuration change—without waiting for real traffic.

Common Issues in OpenTelemetry Pipelines

Telemetry pipelines evolve with scale, and that’s usually when small things start surfacing. Data stops mid-flight, memory usage grows unexpectedly, or traces arrive incomplete.

Let’s look at what typically breaks, why it happens, and what you can do to stabilize your pipeline.

Data Disappears Somewhere in the Pipeline

Your application generates spans, but they never reach your backend. Sometimes only part of a trace appears while the rest vanishes.

Common causes:

Queue overflow: The Collector can’t export data fast enough, and its in-memory queue fills up. The default queue size is 1000 batches; once full, new data is dropped.
Backend unavailable: When your backend is slow or unavailable, the Collector retries for up to five minutes (the default timeout) before giving up.
Missing memory_limiter: Without this processor, the Collector continues accepting data until it runs out of memory and crashes — losing everything in flight.
Batch processor misconfiguration: Oversized batches can exceed backend request limits, causing the entire batch to be rejected.

You’ll notice this pattern in metrics:

Sustained increases in otelcol_receiver_refused_spans or otelcol_exporter_send_failed_spans
Log messages such as “Dropping data because sending_queue is full”

What you can do is monitor these queue and exporter metrics continuously. Add the memory_limiter processor to prevent crashes, and tune batch sizes and retry settings to match your backend’s limits and latency.

Memory Pressure and Out-of-Memory Crashes

A steady rise in memory usage followed by OOM kills usually points to the Collector holding onto more data than expected.

Common causes:

Tail sampling holding large traces in memory until the decision_wait timeout expires.
High-cardinality metrics multiplying into millions of time series.
Batch processor leaks from metadata-based grouping (e.g., tenant_id).
Resource churn — attributes like process.pid changing with every restart.

Start by enabling the pprof extension to analyze where memory is being used:

extensions:
  pprof:
    endpoint: localhost:1777

service:
  extensions: [pprof]

Access http://localhost:1777/debug/pprof/heap to review heap allocations.

Then configure proper limits:

processors:
  memory_limiter:
    check_interval: 1s
    limit_percentage: 75
    spike_limit_percentage: 20

  batch:
    send_batch_size: 1000
    timeout: 5s

  tail_sampling:
    decision_wait: 10s
    num_traces: 20000

If you’re deploying on Kubernetes, give the Collector enough headroom for Go’s garbage collector:

resources:
  requests:
    memory: "2Gi"
    cpu: "1000m"
  limits:
    memory: "3Gi"
    cpu: "2000m"

As a rule of thumb, keep memory limits 25-30% above baseline usage.

When dealing with high-cardinality metrics, you can filter or drop unnecessary labels:

processors:
  filter/cardinality:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ".*_bucket" # Drop unneeded histogram buckets

  resource:
    attributes:
      - key: process.pid
        action: delete
      - key: process.parent_pid
        action: delete

Traces Are Incomplete or Fragmented

Traces sometimes appear with missing parent spans, forming orphaned segments instead of a complete path.

Common causes:

Tail sampling challenges: Spans for the same trace land on different Collectors, so the tail_sampling processor can’t see the full trace before deciding.
Context propagation failures: Trace headers (traceparent, tracestate) aren’t passed correctly between services — sometimes malformed or stripped by proxies.
Processor context loss: Some processors (like tail_sampling) rebuild span batches and can lose original context links.

You can spot this in metrics when sampling_trace_dropped_too_early spikes before the decision_wait timeout.

To fix this:

Enable trace-ID-based load balancing so spans for a trace reach the same Collector.
Configure the tail_sampling processor with a realistic decision_wait timeout.
Check that all services propagate trace headers correctly.

Network and Connectivity Failures

Errors such as “connection refused” or “context deadline exceeded” in Collector logs usually point to network or protocol mismatches.

Common causes:

Protocol mismatches: The application sends gRPC (4317) while the Collector listens on HTTP (4318).
Missing timeouts: gRPC clients without deadlines hang indefinitely.
Firewall rules: Traffic blocked by Kubernetes or corporate policies.
Wrong endpoint format: Using http://collector:4318 instead of collector:4318.
DNS resolution issues: Service names not resolving correctly in containerized environments.

How to verify connectivity:

# HTTP/OTLP
curl -v http://collector:4318/v1/traces

# gRPC/OTLP
grpcurl -plaintext collector:4317 list

# Port reachability
telnet collector 4317

If these fail, check network policies, open ports, or DNS resolution:

nslookup collector
kubectl exec -it <pod> -- nslookup collector

For gRPC clients, add timeouts:

conn, err := grpc.Dial(
    "collector:4317",
    grpc.WithTransportCredentials(insecure.NewCredentials()),
    grpc.WithTimeout(5*time.Second),
)

Configuration That Fails Silently

The Collector starts fine but behaves unexpectedly.

Common causes:

A receiver, processor, or exporter isn’t added to service.pipelines.
Processor order is incorrect (batch before memory_limiter).
YAML formatting errors or missing keys.
Using a processor for the wrong job (attributes instead of span).

Before deploying, validate your configuration:

otelcol validate --config=config.yaml

This simple step catches broken references and syntax issues before rollout.

Backend Rejecting Batches

When logs show HTTP 413 — Request Entity Too Large, the backend is rejecting oversized batches.

send_batch_size or send_batch_max_size set beyond backend request limits is the cause behind it.

Estimate a safe size based on backend limits. For example, if your backend allows 3.2 MB per request and each record is 2 KB, set batch size to around 1024:

processors:
  batch:
    send_batch_size: 1024
    send_batch_max_size: 1024
    timeout: 5s

Collector Not Receiving Data

Applications export telemetry successfully, but the Collector sees nothing — otelcol_receiver_accepted_spans stays at zero.

Here’s a checklist you can use:

Is the receiver defined and included in the pipeline?
Are ports (4317 or 4318) correct and open?
Are DNS and network policies allowing traffic?

To confirm:

# gRPC test
telnet localhost 4317

# HTTP test
curl -X POST http://localhost:4318/v1/traces \
  -H "Content-Type: application/json" \
  -d '{"resourceSpans":[]}'

Also check that your SDK configuration matches protocol expectations:

OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

Each of these patterns points to how telemetry flows — or fails to. Once you start watching these signals closely, debugging stops being a guessing game and becomes a matter of tracing the story your pipeline is already telling you.

Make OpenTelemetry Pipelines Production-Ready

Observability works best when it’s part of your system from the start. Waiting for incidents before adding it almost always costs more time later.

Build Observability Into Your Pipelines Early

Enable internal telemetry so the Collector can report its own performance.

service:
  telemetry:
    metrics:
      level: detailed
      readers:
        - periodic:
            exporter:
              otlp:
                protocol: http/protobuf
                endpoint: https://your-backend:4318
    logs:
      level: info
      encoding: json

Once that’s in place, you can track a few core metrics:

Queue utilization: otelcol_exporter_queue_size / queue_capacity — ideally under 80%.
Export failures: otelcol_exporter_send_failed_* — should stay close to zero.
Receiver refusals: otelcol_receiver_refused_* — can indicate backpressure.
Memory usage: growing trends often point to leaks.

Set graduated alerts so you have enough time to react:

70% queue utilization → informational
85% → warning
95% → critical escalation

Test Configuration Changes Before Production

Treat Collector configuration changes like code — test before deploying.

You can start with a syntax check:

otelcol validate --config=config.yaml

For a visual review, upload your configuration to otelbin.io to see the full pipeline layout.

To simulate real traffic, use telemetrygen:

docker run ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest \
  traces --otlp-endpoint localhost:4317 --otlp-insecure --duration 30s --rate 100

And if you’re in staging, keep the debug exporter active to confirm that processors behave as expected.

exporters:
  debug:
    verbosity: detailed
  otlp:
    endpoint: backend:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [your-processors-here]
      exporters: [debug, otlp]

Right-Size Your Collectors

Collectors that are too small drop data; ones that are oversized waste compute and mask inefficiencies.

A good starting point is:

2 CPU cores and 2 GB memory for moderate traffic
Monitor otelcol_process_cpu_seconds and otelcol_process_memory_rss
Scale horizontally (more instances) instead of vertically (bigger ones)
Leave 25-30% headroom for Go garbage collection
Test with realistic traffic volumes before rollout

Protect Against Memory Exhaustion

Always include the memory_limiter processor as the first step in every pipeline.

processors:
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, ...] # FIRST
      exporters: [otlp]

It stops the Collector from taking in new data when memory usage crosses the soft limit (limit_mib) and triggers garbage collection at the hard limit (limit_mib + spike_limit_mib). Without it, the Collector keeps accepting data until it crashes — losing everything in memory.

Use Persistent Queues for Critical Data

The default in-memory queue is quick but temporary. When the Collector restarts, anything still in memory is gone. If you’re dealing with critical telemetry, that’s risky.

Persistent queues solve this by writing queued data to disk, so it survives restarts and crashes.

extensions:
  file_storage:
    directory: /var/lib/otelcol/file_storage

exporters:
  otlp:
    endpoint: backend:4317
    sending_queue:
      enabled: true
      storage: file_storage # Reference the extension

service:
  extensions: [file_storage]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

It’s a simple trade-off — a bit more disk I/O for much better durability.
If you’re running in containers, make sure /var/lib/otelcol/file_storage is backed by a persistent volume claim (PVC) so queued data stays intact even when pods restart.

Get More from OpenTelemetry Debugging — with Last9

Once your OpenTelemetry pipelines are optimized and properly monitored, the next challenge is ensuring your backend can handle high-cardinality data without slowing down or driving up costs.

Many backends struggle with attributes such as user_id, request_id, or container_id — fields that can have millions of unique values. To cope, they either:

Drop attributes to reduce costs, or
Suffer query performance issues as cardinality increases

Last9 — an OpenTelemetry-native observability platform — is built to handle this scenario efficiently.

With our platform:

Every attribute you instrument with OpenTelemetry stays fully searchable, even during cardinality spikes.
You can filter traces by any attribute combination without hitting cardinality or performance limits.
Attributes like environment, customer_id, and feature_flag remain queryable, giving you all the context you’ve instrumented.

Beyond data ingestion, Last9 can ingest the Collector’s internal telemetry metrics, allowing you to monitor your OpenTelemetry pipeline’s health alongside your application telemetry in a unified observability platform.

“What I like about Last9 is how convenient it makes debugging. Recently, I had to investigate an incident across 20 microservices. With Last9, I could quickly find the exact service I needed and trace the issue through its logs.”
— Sushant Gupta, Software Engineer, Tazapay

Getting started takes about five minutes. The Collector configuration uses standard OTLP—just add Last9’s endpoint and credentials to your existing setup. There’s no vendor lock-in or proprietary format to manage. And if you’re stuck at any point, connect with our experts on how Last9 fits within your stack!

In our next guide, we’ll explore Logs-to-Metrics with OpenTelemetry — why it matters, how to set it up with the Collector, and ways to improve performance.

How to Debug OpenTelemetry Pipelines

Contents

The Observability Paradox: Monitoring Your Monitoring

Debugging Toolkit by OTel

Debug Exporter

Internal telemetry

zPages

Health check extension

Configuration Validation

Testing tools

Common Issues in OpenTelemetry Pipelines

Data Disappears Somewhere in the Pipeline

Memory Pressure and Out-of-Memory Crashes

Traces Are Incomplete or Fragmented

Network and Connectivity Failures

Configuration That Fails Silently

Backend Rejecting Batches

Collector Not Receiving Data

Make OpenTelemetry Pipelines Production-Ready

Build Observability Into Your Pipelines Early

Test Configuration Changes Before Production

Right-Size Your Collectors

Protect Against Memory Exhaustion

Use Persistent Queues for Critical Data

Get More from OpenTelemetry Debugging — with Last9

Contents

Do More with Less