Guide — The OpenTelemetry Collector Deep Dive

In earlier posts, we looked at what OpenTelemetry is and how it helps you generate traces, metrics, and logs from your code. But collecting telemetry isn’t the hard part. Managing it, without ballooning costs, silos, or manual setup, is where most teams get stuck. This is where infrastructure teams lean on the OpenTelemetry Collector.

You need a way to move telemetry across services, enrich it, and shape it for downstream systems, without overwhelming your network or backend. And with today’s setups spanning batch jobs, AI inference layers, and cross-region components, that middle layer matters more than ever. The Collector is built to handle this.

What is the OpenTelemetry Collector?

The OpenTelemetry Collector is a vendor-agnostic telemetry pipeline that sits between your applications and observability backends. While you could send telemetry directly from your SDKs to backends, you’ll want the Collector when you need to transform data, reduce costs through sampling, or avoid vendor lock-in.

It accepts data from multiple sources, processes it, and ships it to one or more destinations. It’s like a switchboard operator receiving, transforming, and forwarding telemetry without hard-wiring any app to a specific backend.

It supports common protocols and formats — OTLP, Jaeger, Zipkin, Prometheus, FluentBit, and others, and can export to systems like Last9, Kafka, Prometheus, Jaeger, etc. This means your services can stay loosely coupled to the backend stack, which makes upgrades, migrations, and experimentation a lot easier.

What makes the Collector especially useful is its modular design. Instead of a one-size-fits-all agent, it gives you building blocks to create custom pipelines, tailored to your infrastructure, traffic patterns, and cost limits.

Where Does the Collector Fit?

The Collector can run in a few different ways, depending on how your infrastructure is laid out:

Agent Mode: Deployed alongside each app (usually as a sidecar or daemon). Useful for injecting host-level metadata, filtering junk early, and reducing network chatter. Trade-off: Lower latency and better reliability, but higher resource overhead per service.
Gateway Mode: A central service that receives telemetry from multiple sources. Better suited for batching, routing, and processing at scale. Trade-off: Better for compliance and cost control, but creates a single point of failure and potential bottleneck.
Hybrid Mode: Combines the two. Agents handle local tasks, while the gateway takes care of the heavy lifting. This is the go-to setup for larger, distributed systems where you need both local processing and centralized control.

How the Collector Works: Pipelines

The OpenTelemetry Collector runs on a simple but flexible pipeline model.

OpenTelemetry Collector Pipeline Model

Source: https://opentelemetry.io/docs/collector/architecture/

Connectors are optional. They’re used when you need to route data between multiple pipelines, for example, converting traces to metrics or reusing telemetry in different workflows.

Here’s what each stage does:

Receivers
Entry points for telemetry. They support protocols like OTLP, Jaeger, Zipkin, and Prometheus. Once data is ingested, it’s translated into a consistent format.
Processors
This is where data gets shaped. Processors can be chained together to drop unused or noisy data, enrich or modify metadata, apply sampling to reduce volume, or batch and compress payloads for efficiency. It’s the tuning stage, where you optimize signals for clarity, cost, and performance.
Exporters
Responsible for sending data out. Whether it’s Last9, Prometheus, or Kafka, exporters handle delivery, retries, and any protocol conversions.
Connectors
They’re used to route data between multiple pipelines, for example, converting traces to metrics or reusing telemetry across workflows.
Extensions
Not in the core data path, but essential for operations. Extensions handle things like health checks, TLS, authentication, and debug endpoints like /metrics, /debug/pprof, and zPages.

Now, the Collector becomes your observability control plane, tuned for the kind of modern workloads that break traditional monitoring setups.

Otel Collector Pipelines: Types and Boundaries

When working with the Collector, it’s easy to focus on wiring together receivers, processors, and exporters. But an important design choice underpins all of this: traces, metrics, and logs are handled separately.

For example, trace pipelines may use sampling, while metrics pipelines often batch or aggregate data. This separation prevents data loss and instability, especially under load.

Here’s how pipeline isolation plays out in practice:

Trace pipelines
These deal with high cardinality and volume. You’ll often want to enrich spans with metadata, apply sampling to reduce noise, and batch them before export. Done right, this keeps your tracing backend from getting overwhelmed.
Metrics pipelines
Metrics are more predictable, but they pile up fast, especially with containerized apps or scraping-heavy setups. Focus here tends to be on detecting resources, aggregating time series, and batching for efficient export.
Log pipelines
Logs are messy, unstructured text, multiline events, and noisy debug statements. These pipelines often need parsing, filtering, formatting, or even redaction processors before the logs are usable.

Keeping pipelines isolated gives you a few key advantages.

For example, if there’s a sudden spike in log volume, say, someone enables verbose debug logs, the log pipeline might slow down or drop data. But your traces and metrics continue to flow just fine, because they’re running in separate pipelines.

You can also scale each pipeline based on its load. If the metrics traffic grows, you scale just the metrics pipeline. And when something breaks, isolation makes it easier to debug; you know exactly where to look without guessing which part of the system caused the issue.

Now the next step is understanding how those pipelines work — how data flows through them and why the order of each component matters.

Pipeline Overview

The OpenTelemetry Collector’s pipeline is a structured, performance-conscious flow designed to keep up with modern telemetry demands. From how data moves through each stage to how threads are allocated, every part is tuned for reliability under pressure.

Core Processors

NOTE: Processors in a Collector pipeline run in the order you define in your configuration. That order directly affects performance and reliability.

Here’s a typical production sequence:

processors: [memory_limiter, resourcedetection, attributes, batch]

These aren’t the bare minimum, but they’re usually the first set you add when moving beyond defaults, especially in production workloads.

Here’s what each essential processor does:

memory_limiter: Prevents the Collector from running out of memory by capping total usage and allowing for controlled spikes.

Absolute Memory Limits:
```
processors:
  memory_limiter:
    limit_mib: 2048 # Hard limit: 2GB
    spike_limit_mib: 512 # Allow spikes up to 2.5GB total
    check_interval: 2s
```
Percentage-Based Limits (Recommended):
```
processors:
  memory_limiter:
    limit_percentage: 80 # 80% of available system memory
    spike_limit_percentage: 90 # Allow spikes up to 90%
    check_interval: 2s
```
Choosing the Right Percentage:
- 70%: More conservative, triggers earlier, good for shared environments
- 80%: Balanced approach, common in production
- 90%: Aggressive, maximizes memory use, but higher risk of OOM
How Memory Limiting Works:
1. The Collector starts rejecting new data when limits are hit
2. Existing data continues processing
3. Memory pressure reduces as data flows out
4. Normal operation resumes when below the limits
Why use percentages: They automatically adapt to different deployment environments without manual tuning, making your configuration more portable across different infrastructures.
resourcedetection: Auto-detects metadata from the environment, like cloud provider, region, or Kubernetes namespace.
```
processors:
  resourcedetection:
    detectors: [env, system, docker, kubernetes]
```
Beyond YAML: Visual Data Transformation

Managing complex transformation rules across environments gets messy with YAML alone. Last9’s Control Plane offers Extract/Remap/Sensitive Data through a visual interface:
- Extract: Pull specific fields from telemetry
- Remap: Transform field names and values
- Sensitive Data: Auto-detect and redact PII, API keys
- Fanout: Route data to different backends
Use YAML for simple, static rules. Use the UI for complex logic, team collaboration, and frequent changes.
attributes: Gives you direct control over which fields to keep, drop, rename, or enrich.

Typical uses:
- Remove sensitive fields (e.g., auth headers)
- Normalize service names
- Add deployment metadata from environment variables
```
processors:
  attributes:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: deployment.environment
        value: ${env:ENVIRONMENT}
        action: upsert
```
This processor is your main tool for enforcing naming consistency and privacy policies across telemetry data.
batch: Groups telemetry before export for better throughput and efficiency. Always place this last.
```
processors:
  batch:
    timeout: 5s
    send_batch_size: 512
```

Specialized Processors

Beyond the core processors, the Collector offers specialized processors for complex telemetry scenarios.

transform: Complex Logic, Declaratively

The transform processor is more flexible and expressive than attributes. It uses the OpenTelemetry Transformation Language (OTTL) to apply conditions, regexes, and logic for advanced manipulation.

Example use cases:

Redact PII in span names, log messages, and metric labels
Tag timeouts in error spans
Modify metric names conditionally
Filter sensitive data across all telemetry types

processors:
  transform:
    # Redact PII in span names
    trace_statements:
      - context: span
        statements:
          - set(name, "redacted") where name matches ".*user_id.*"
    # Redact PII in log messages
    log_statements:
      - context: log
        statements:
          - set(body, "user email redacted") where body matches ".*@.*"
    # Redact PII in metric labels
    metric_statements:
      - context: metric
        statements:
          - delete_key(attributes, "user_id") where attributes["user_id"] != nil

Use transform when basic add/delete/update actions aren’t enough.

resource: Manual Metadata Control

While resourcedetection auto-discovers metadata, the resource processor lets you explicitly add or override attributes:

processors:
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
      - key: service.version
        value: ${env:APP_VERSION}
        action: insert

Perfect for enforcing specific tagging policies or adding deployment-specific context.

Choosing the Right Processor

Purpose	Processor
Prevent crashes during telemetry spikes	memory_limiter
Auto-attach environment/infrastructure metadata	resourcedetection
Manually add or override metadata	resource
Basic data cleanup and tagging	attributes
Advanced transformations and conditional logic	transform

Scaling and Performance Optimization

Concurrency and Threading

The Collector processes telemetry in parallel, and each pipeline component is designed to handle data concurrently. This gives you fine control over throughput and system behavior.

Here’s what that looks like across the pipeline:
- Receivers: Usually single-threaded per endpoint. Keeps protocol handling predictable and resource usage low.
- Processors: Can run in parallel, often scaled to the number of CPU cores. Useful for compute-heavy tasks like parsing logs, modifying spans, or applying sampling.
- Exporters: Support multiple workers for concurrent export. You can tune batch sizes and worker counts to handle spikes in telemetry volume.
Example config:
```
processors:
  batch:
    send_batch_size: 8192
    timeout: 200ms

exporters:
  otlp:
    sending_queue:
      num_consumers: 16 # Parallel workers
      queue_size: 5000 # Buffer size
```
This setup lets the Collector absorb high-throughput workloads without becoming a bottleneck, especially during spikes or failover events.

For receivers, concurrency options depend on the protocol. For example, the otlp receiver handles incoming gRPC or HTTP requests with built-in concurrency, but it doesn’t expose worker config. Some receivers, like prometheus, let you control scrape concurrency or intervals. It’s not one-size-fits-all, but tuning receivers usually involves scrape settings, buffer sizes, or connection limits rather than explicit worker counts.
Backpressure and Flow Control

Backpressure in a telemetry pipeline isn’t just a performance issue; it’s a risk factor for data loss. Here’s how the data flows:

If the exporter queue fills up, maybe the backend is slow or unreachable, that pressure pushes back through the processor queue and eventually blocks the receivers. When receivers can’t take in new data, telemetry starts getting dropped at the source.

This is where the memory_limiter earns its keep. It doesn’t just cap memory use — it drops data early to prevent the entire pipeline from stalling.

In production, you’ll usually spot backpressure through growing queues, higher memory use, or an increase in dropped data. These are your clues for where to scale or tune before things start breaking.

Extensions: Health Checks, Profiling, and Debugging

Beyond telemetry flow, the Collector supports extensions that provide critical operational hooks for health monitoring, diagnostics, and security. These aren’t part of the main pipeline but are essential in production environments.

Health Check: For Load Balancers and Orchestration

The health_check extension exposes a simple HTTP endpoint to monitor the Collector’s health. It’s used by Kubernetes probes, service meshes, or load balancers to verify liveness.
```
extensions:
  health_check:
    endpoint: 0.0.0.0:13133
    path: /health
    check_collector_pipeline:
      enabled: true
      interval: 5s
      exporter_failure_threshold: 5
```
This ensures your Collector isn’t just up, but working, with its pipelines and exporters in good shape.
Performance Profiling (pprof)

Enabling pprof gives you low-level performance insight—CPU usage, memory allocation, and blocking operations.
```
extensions:
  pprof:
    endpoint: localhost:1777
```
Use this during staging or incident debugging. But in production, keep it locked down or disabled to avoid exposure and runtime overhead.
zPages: Real-Time Debugging

zPages offers real-time visibility into the Collector’s internals. It’s a web UI that shows span processing, queue stats, and service health.
```
extensions:
  zpages:
    endpoint: localhost:55679
```
Useful endpoints:
- /debug/pipelinez: Status of pipelines
- /debug/servicez: Service health
This can speed up troubleshooting when something feels “off” in the pipeline.

Troubleshooting Production Issues

Configuration Issues

Wrong processor order can break your pipeline in subtle ways:

Using batch too early: You’ll batch data that gets dropped or modified later, wasting compute and memory
Skipping memory_limiter: Your Collector will hit resource limits and crash without warning
Putting attributes after batch: Attributes won’t be applied to already-batched data
Missing resourcedetection early: Metadata won’t be available for downstream processors that need it
Golden rule: Always sequence as memory_limiter → detection/enrichment → filtering/sampling → batch

Runtime Issues

Memory Leaks and High Usage

What you’ll see: Collectors crashing randomly, getting killed, or just slowing down when traffic spikes or after running for a while.

Keep an eye on:
- otelcol_process_memory_rss: tracks actual memory used by the Collector. Sudden jumps? It could be a leak.
- otelcol_process_virtual_memory: watches virtual memory, including mapped files, which can also creep up.
What to do: Use the memory_limiter processor to keep memory usage in check, ideally just below your container or pod limits.
```
processors:
  memory_limiter:
    limit_mib: 2048
    spike_limit_mib: 512
    check_interval: 2s
```
Also, try running heap profiling (pprof) in staging to spot where memory is leaking or hogging resources.
Export Failures and Retry Overhead

What you’ll see: Your backends aren’t getting data, or the Collector’s CPU spikes because it’s hammering retries.

Keep an eye on:
- otelcol_exporter_send_failed_*: Counts how many export attempts failed for each backend.
- otelcol_exporter_send_queue_size: Growing queues mean data is backing up and not flowing out fast enough.
What to do: Set up retry policies and queues carefully for each exporter. Use circuit breakers if your backend goes down, so the Collector doesn’t drown in retries.

Also, enable debug logging for exporters to figure out if failures are random glitches or something bigger.
Backpressure and Dropped Data

What you’ll see: Data disappearing or delayed, receivers slowing down, or outright rejecting telemetry.

Keep an eye on:
- otelcol_receiver_refused_spans, otelcol_receiver_refused_metrics, otelcol_receiver_refused_logs: These show when receivers are overloaded and rejecting data.
- otelcol_queue_size_*: Indicates if processor or exporter queues are filling up, which can lead to dropped data or lag.
What to do: Batch and compress telemetry to reduce payload size. Make sure your batch processor is the last step before exporting.

If CPU or memory use stays above 70%, it’s time to scale out horizontally.

Debugging Workflow

When things get tricky, turn up the Collector’s logging to debug mode to see what’s going on behind the scenes. Pair this with a logging exporter for plain-text insights.

service:
  telemetry:
    logs:
      level: debug

exporters:
  logging:
    loglevel: debug
    sampling_initial: 2
    sampling_thereafter: 500

Don’t forget to use sampling so your logs don’t flood with data, but still show enough to catch problems.

Using Last9 with OTel Collector

The OpenTelemetry Collector gives you the control and flexibility you need for production observability. Both direct SDK exports and Collector-based approaches have their place - choose based on your specific needs.

Getting started with Last9 takes about 5 minutes. The Collector integrates easily using Prometheus remote write — just sign up, grab your credentials, and update your Collector config. Follow our OpenTelemetry Collector integration guide to get up and running.

This gives you enterprise-grade observability without the complexity of managing multiple vendor integrations or the performance overhead of direct SDK exports hitting multiple backends.

In our next part, we’ll explore designing reliable telemetry, covering logs, metrics, traces, plus signal modeling, cardinality, correlation, and alert design.

The OpenTelemetry Collector Deep Dive

Contents