You’ve deployed OpenTelemetry across your stack — traces, metrics, and logs flow through the Collector to your observability backend. Most days, everything runs fine. Until the day it doesn’t.
When production breaks at 2 AM and your telemetry pipeline goes dark — right when you need insights most — how do you debug it? When spans disappear, metrics get dropped, or the Collector crashes under load, the cause isn’t always clear. You’re left wondering if the issue lies in your application instrumentation, Collector configuration, network, or backend.
In this part of the OTel series, we focus on systematic ways to debug OpenTelemetry pipelines: common failure modes, practical debugging workflows, and tools that help you pinpoint issues faster.
The Observability Paradox: Monitoring Your Monitoring
When the OpenTelemetry Collector fails, it can take your entire observability stack with it. Suddenly, there’s no data from your applications—right when you need it most.
A solution to this is to make the Collector observable as well. It’s the classic “who watches the watchers” problem. The Collector should emit its own telemetry: metrics that track data flow, logs that capture errors, and health endpoints that surface its status. Without that signal, troubleshooting becomes guesswork.
By default, the Collector exposes internal metrics on port 8888 in Prometheus format, including:
otelcol_receiver_accepted_spans: incoming dataotelcol_exporter_sent_spans: outgoing dataotelcol_exporter_queue_size: exporter queue utilization
These metrics help you confirm whether the Collector is running as expected and where data might be stuck.
The key shift is to treat your telemetry pipeline like any other production system—with its own dashboards, alerts, and reliability targets.
Debugging Toolkit by OTel
OpenTelemetry includes several built-in tools to help troubleshoot your setup. Let’s understand them:
Debug Exporter
The most straightforward one is the debug exporter, which prints telemetry data directly to the console. It’s a quick way to verify that the Collector is receiving and processing data correctly—no backend required.
exporters: debug: verbosity: detailed # basic | normal | detailed
service: pipelines: traces: receivers: [otlp] processors: [batch] exporters: [debug] # Outputs to consoleVerbosity levels:
- basic: Shows a single-line summary with record counts
- normal: Displays one line per telemetry record
- detailed: Prints complete details for every record
Start with the debug exporter when you’re setting up a new pipeline, testing configuration changes, or investigating data transformation issues. It’s essentially “printf debugging” for OpenTelemetry—fast feedback without depending on external systems.
The debug exporter replaces the deprecated logging exporter starting from version v0.111.0. If you’re using older configurations, update logging to debug.
Internal telemetry
By default, the OTel collector exposes Prometheus metrics on port 8888, giving you visibility into how the Collector itself is performing. This self-monitoring layer is crucial for production setups because it helps you spot issues before they cascade into lost data.
What to Look For
The Collector’s internal metrics tell a story about your pipeline’s health:
-
Data flow: Metrics like
otelcol_receiver_accepted_*andotelcol_exporter_sent_*confirm data is being received and sent correctly. A rise inotelcol_receiver_refused_*orotelcol_exporter_send_failed_*usually means something’s stuck. -
Queue health: Watch
otelcol_exporter_queue_sizeandotelcol_exporter_queue_capacityto see if your exporter queues are filling up. Frequentotelcol_exporter_enqueue_failed_*signals dropped data. -
Batch performance: The
otelcol_processor_batch_*metrics show how batches are sent — whether because they’ve reached their size or timeout thresholds. -
Resource usage:
otelcol_process_memory_rss,otelcol_process_cpu_seconds_total, andotelcol_process_runtime_heap_alloc_bytesreveal how much system memory and CPU the Collector consumes over time.
Here’s how you can enable it:
Internal telemetry is configured in the service.telemetry section of the Collector config.
service: telemetry: metrics: level: normal # none | basic | normal | detailed readers: - pull: exporter: prometheus: host: "0.0.0.0" port: 8888 logs: level: info encoding: jsonThis setup exposes metrics on port 8888. You can then scrape these metrics using Prometheus or even configure the Collector to monitor itself.
For example, to send these internal metrics to your backend:
receivers: prometheus: config: scrape_configs: - job_name: otel-collector-internal scrape_interval: 10s static_configs: - targets: ["localhost:8888"]
exporters: otlp: endpoint: monitoring-backend:4317
service: pipelines: metrics: receivers: [prometheus] processors: [batch] exporters: [otlp]With this setup, the Collector becomes self-aware — it monitors its own throughput, queue depth, and resource usage. That’s often the first clue when something starts to degrade silently.
zPages
The zPages extension gives you a live, web-based view of what the Collector is doing in real time. It’s one of the fastest ways to understand what’s happening inside your telemetry pipeline—no logs, no dashboards, just direct insight into active operations.
extensions: zpages: endpoint: 0.0.0.0:55679 # Expose on all interfaces for containers
service: extensions: [zpages]Once enabled, you can open the TraceZ interface at http://localhost:55679/debug/tracez to:
- Spot spans that never finish — often signs of deadlocks or missing
endSpan()calls. - Identify slow operations contributing to latency.
- Review error types and counts as they occur.
- Drill into specific traces to understand Collector processing behavior.
The zPages extension is especially useful during live troubleshooting, when you need to inspect the Collector’s behavior without exporting data elsewhere. It gives you a direct window into the spans being processed and helps confirm whether the Collector is stalled, overloaded, or simply waiting for data.
Health check extension
The health_check extension provides a simple HTTP endpoint that reports the Collector’s status. It’s useful for ensuring your telemetry pipeline stays healthy and for integrating with monitoring or orchestration systems.
extensions: health_check: endpoint: 0.0.0.0:13133 path: "/health/status" check_collector_pipeline: enabled: true interval: "5m" exporter_failure_threshold: 5
service: extensions: [health_check]Once enabled, you can access the health endpoint at http://localhost:13133/.
Use this extension for:
- Kubernetes probes: Configure liveness and readiness checks against the endpoint.
- Load balancers: Route traffic only to healthy Collectors.
- Monitoring systems: Trigger alerts when the Collector reports unhealthy status.
Configuration Validation
Before deploying any configuration changes, validate them using the built-in command:
otelcol validate --config=/path/to/config.yamlThis command checks for syntax errors, missing fields, and invalid component references. It’s a simple way to prevent bad configs from reaching production—ideal for running in CI/CD pipelines.
Listing Available Components
You can also inspect which components your Collector build supports:
otelcol componentsThis lists all receivers, processors, exporters, and extensions included in your current Collector distribution, along with their stability levels (development, alpha, beta, or stable). It’s a quick sanity check before adding a new component to your configuration.
Testing tools
The easiest way to test your telemetry pipeline is with telemetrygen, the official OpenTelemetry tool for generating test traces, metrics, and logs. It lets you simulate load and verify that your Collector is configured correctly—no application instrumentation required.
# Generate test tracesdocker run --network host \ ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest \ traces \ --otlp-endpoint localhost:4317 \ --otlp-insecure \ --duration 30s \ --rate 10This command sends 10 traces per second for 30 seconds to your Collector.
Use telemetrygen to:
- Test new Collector configurations
- Validate pipeline transformations and data routing
- Load test Collectors before production rollout
- Reproduce pipeline issues in isolation
It’s especially handy when you want to test how the Collector behaves under load or after a configuration change—without waiting for real traffic.
Common Issues in OpenTelemetry Pipelines
Telemetry pipelines evolve with scale, and that’s usually when small things start surfacing. Data stops mid-flight, memory usage grows unexpectedly, or traces arrive incomplete.
Let’s look at what typically breaks, why it happens, and what you can do to stabilize your pipeline.
Data Disappears Somewhere in the Pipeline
Your application generates spans, but they never reach your backend. Sometimes only part of a trace appears while the rest vanishes.
Common causes:
- Queue overflow: The Collector can’t export data fast enough, and its in-memory queue fills up. The default queue size is 1000 batches; once full, new data is dropped.
- Backend unavailable: When your backend is slow or unavailable, the Collector retries for up to five minutes (the default timeout) before giving up.
- Missing memory_limiter: Without this processor, the Collector continues accepting data until it runs out of memory and crashes — losing everything in flight.
- Batch processor misconfiguration: Oversized batches can exceed backend request limits, causing the entire batch to be rejected.
You’ll notice this pattern in metrics:
- Sustained increases in
otelcol_receiver_refused_spansorotelcol_exporter_send_failed_spans - Log messages such as “Dropping data because sending_queue is full”
What you can do is monitor these queue and exporter metrics continuously. Add the memory_limiter processor to prevent crashes, and tune batch sizes and retry settings to match your backend’s limits and latency.
Memory Pressure and Out-of-Memory Crashes
A steady rise in memory usage followed by OOM kills usually points to the Collector holding onto more data than expected.
Common causes:
- Tail sampling holding large traces in memory until the decision_wait timeout expires.
- High-cardinality metrics multiplying into millions of time series.
- Batch processor leaks from metadata-based grouping (e.g., tenant_id).
- Resource churn — attributes like
process.pidchanging with every restart.
Start by enabling the pprof extension to analyze where memory is being used:
extensions: pprof: endpoint: localhost:1777
service: extensions: [pprof]Access http://localhost:1777/debug/pprof/heap to review heap allocations.
Then configure proper limits:
processors: memory_limiter: check_interval: 1s limit_percentage: 75 spike_limit_percentage: 20
batch: send_batch_size: 1000 timeout: 5s
tail_sampling: decision_wait: 10s num_traces: 20000If you’re deploying on Kubernetes, give the Collector enough headroom for Go’s garbage collector:
resources: requests: memory: "2Gi" cpu: "1000m" limits: memory: "3Gi" cpu: "2000m"As a rule of thumb, keep memory limits 25-30% above baseline usage.
When dealing with high-cardinality metrics, you can filter or drop unnecessary labels:
processors: filter/cardinality: metrics: exclude: match_type: regexp metric_names: - ".*_bucket" # Drop unneeded histogram buckets
resource: attributes: - key: process.pid action: delete - key: process.parent_pid action: deleteTraces Are Incomplete or Fragmented
Traces sometimes appear with missing parent spans, forming orphaned segments instead of a complete path.
Common causes:
- Tail sampling challenges: Spans for the same trace land on different Collectors, so the tail_sampling processor can’t see the full trace before deciding.
- Context propagation failures: Trace headers (traceparent, tracestate) aren’t passed correctly between services — sometimes malformed or stripped by proxies.
- Processor context loss: Some processors (like tail_sampling) rebuild span batches and can lose original context links.
You can spot this in metrics when sampling_trace_dropped_too_early spikes before the decision_wait timeout.
To fix this:
- Enable trace-ID-based load balancing so spans for a trace reach the same Collector.
- Configure the tail_sampling processor with a realistic decision_wait timeout.
- Check that all services propagate trace headers correctly.
Network and Connectivity Failures
Errors such as “connection refused” or “context deadline exceeded” in Collector logs usually point to network or protocol mismatches.
Common causes:
- Protocol mismatches: The application sends gRPC (4317) while the Collector listens on HTTP (4318).
- Missing timeouts: gRPC clients without deadlines hang indefinitely.
- Firewall rules: Traffic blocked by Kubernetes or corporate policies.
- Wrong endpoint format: Using
http://collector:4318instead ofcollector:4318. - DNS resolution issues: Service names not resolving correctly in containerized environments.
How to verify connectivity:
# HTTP/OTLPcurl -v http://collector:4318/v1/traces
# gRPC/OTLPgrpcurl -plaintext collector:4317 list
# Port reachabilitytelnet collector 4317If these fail, check network policies, open ports, or DNS resolution:
nslookup collectorkubectl exec -it <pod> -- nslookup collectorFor gRPC clients, add timeouts:
conn, err := grpc.Dial( "collector:4317", grpc.WithTransportCredentials(insecure.NewCredentials()), grpc.WithTimeout(5*time.Second),)Configuration That Fails Silently
The Collector starts fine but behaves unexpectedly.
Common causes:
- A receiver, processor, or exporter isn’t added to
service.pipelines. - Processor order is incorrect (batch before memory_limiter).
- YAML formatting errors or missing keys.
- Using a processor for the wrong job (attributes instead of span).
Before deploying, validate your configuration:
otelcol validate --config=config.yamlThis simple step catches broken references and syntax issues before rollout.
Backend Rejecting Batches
When logs show HTTP 413 — Request Entity Too Large, the backend is rejecting oversized batches.
send_batch_size or send_batch_max_size set beyond backend request limits is the cause behind it.
Estimate a safe size based on backend limits. For example, if your backend allows 3.2 MB per request and each record is 2 KB, set batch size to around 1024:
processors: batch: send_batch_size: 1024 send_batch_max_size: 1024 timeout: 5sCollector Not Receiving Data
Applications export telemetry successfully, but the Collector sees nothing — otelcol_receiver_accepted_spans stays at zero.
Here’s a checklist you can use:
- Is the receiver defined and included in the pipeline?
- Are ports (4317 or 4318) correct and open?
- Are DNS and network policies allowing traffic?
To confirm:
# gRPC testtelnet localhost 4317
# HTTP testcurl -X POST http://localhost:4318/v1/traces \ -H "Content-Type: application/json" \ -d '{"resourceSpans":[]}'Also check that your SDK configuration matches protocol expectations:
OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317OTEL_EXPORTER_OTLP_PROTOCOL=grpcEach of these patterns points to how telemetry flows — or fails to. Once you start watching these signals closely, debugging stops being a guessing game and becomes a matter of tracing the story your pipeline is already telling you.
Make OpenTelemetry Pipelines Production-Ready
Observability works best when it’s part of your system from the start. Waiting for incidents before adding it almost always costs more time later.
Build Observability Into Your Pipelines Early
Enable internal telemetry so the Collector can report its own performance.
service: telemetry: metrics: level: detailed readers: - periodic: exporter: otlp: protocol: http/protobuf endpoint: https://your-backend:4318 logs: level: info encoding: jsonOnce that’s in place, you can track a few core metrics:
- Queue utilization:
otelcol_exporter_queue_size/queue_capacity— ideally under 80%. - Export failures:
otelcol_exporter_send_failed_*— should stay close to zero. - Receiver refusals:
otelcol_receiver_refused_*— can indicate backpressure. - Memory usage: growing trends often point to leaks.
Set graduated alerts so you have enough time to react:
- 70% queue utilization → informational
- 85% → warning
- 95% → critical escalation
Test Configuration Changes Before Production
Treat Collector configuration changes like code — test before deploying.
You can start with a syntax check:
otelcol validate --config=config.yamlFor a visual review, upload your configuration to otelbin.io to see the full pipeline layout.
To simulate real traffic, use telemetrygen:
docker run ghcr.io/open-telemetry/opentelemetry-collector-contrib/telemetrygen:latest \ traces --otlp-endpoint localhost:4317 --otlp-insecure --duration 30s --rate 100And if you’re in staging, keep the debug exporter active to confirm that processors behave as expected.
exporters: debug: verbosity: detailed otlp: endpoint: backend:4317
service: pipelines: traces: receivers: [otlp] processors: [your-processors-here] exporters: [debug, otlp]Right-Size Your Collectors
Collectors that are too small drop data; ones that are oversized waste compute and mask inefficiencies.
A good starting point is:
- 2 CPU cores and 2 GB memory for moderate traffic
- Monitor
otelcol_process_cpu_secondsandotelcol_process_memory_rss - Scale horizontally (more instances) instead of vertically (bigger ones)
- Leave 25-30% headroom for Go garbage collection
- Test with realistic traffic volumes before rollout
Protect Against Memory Exhaustion
Always include the memory_limiter processor as the first step in every pipeline.
processors: memory_limiter: check_interval: 1s limit_mib: 512 spike_limit_mib: 128
service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, ...] # FIRST exporters: [otlp]It stops the Collector from taking in new data when memory usage crosses the soft limit (limit_mib) and triggers garbage collection at the hard limit (limit_mib + spike_limit_mib). Without it, the Collector keeps accepting data until it crashes — losing everything in memory.
Use Persistent Queues for Critical Data
The default in-memory queue is quick but temporary. When the Collector restarts, anything still in memory is gone. If you’re dealing with critical telemetry, that’s risky.
Persistent queues solve this by writing queued data to disk, so it survives restarts and crashes.
extensions: file_storage: directory: /var/lib/otelcol/file_storage
exporters: otlp: endpoint: backend:4317 sending_queue: enabled: true storage: file_storage # Reference the extension
service: extensions: [file_storage] pipelines: traces: receivers: [otlp] processors: [batch] exporters: [otlp]It’s a simple trade-off — a bit more disk I/O for much better durability.
If you’re running in containers, make sure /var/lib/otelcol/file_storage is backed by a persistent volume claim (PVC) so queued data stays intact even when pods restart.
Get More from OpenTelemetry Debugging — with Last9
Once your OpenTelemetry pipelines are optimized and properly monitored, the next challenge is ensuring your backend can handle high-cardinality data without slowing down or driving up costs.
Many backends struggle with attributes such as user_id, request_id, or container_id — fields that can have millions of unique values. To cope, they either:
- Drop attributes to reduce costs, or
- Suffer query performance issues as cardinality increases
Last9 - an OTel-native telemetry data platform is built to handle this scenario efficiently.
With our platform:
- Every attribute you instrument with OpenTelemetry stays fully searchable, even during cardinality spikes.
- You can filter traces by any attribute combination without hitting cardinality or performance limits.
- Attributes like
environment,customer_id, andfeature_flagremain queryable, giving you all the context you’ve instrumented.
Beyond data ingestion, Last9 can ingest the Collector’s internal telemetry metrics, allowing you to monitor your OpenTelemetry pipeline’s health alongside your application telemetry in a unified observability platform.
“What I like about Last9 is how convenient it makes debugging. Recently, I had to investigate an incident across 20 microservices. With Last9, I could quickly find the exact service I needed and trace the issue through its logs.”
— Sushant Gupta, Software Engineer, Tazapay
Getting started takes about five minutes. The Collector configuration uses standard OTLP—just add Last9’s endpoint and credentials to your existing setup. There’s no vendor lock-in or proprietary format to manage. And if you’re stuck at any point, connect with our experts on how Last9 fits within your stack!
In our next guide, we’ll explore Logs-to-Metrics with OpenTelemetry — why it matters, how to set it up with the Collector, and ways to improve performance.