In earlier posts, we looked at what OpenTelemetry is and how it helps you generate traces, metrics, and logs from your code. But collecting telemetry isn’t the hard part. Managing it, without ballooning costs, silos, or manual setup, is where most teams get stuck. This is where infrastructure teams lean on the OpenTelemetry Collector.
You need a way to move telemetry across services, enrich it, and shape it for downstream systems, without overwhelming your network or backend. And with today’s setups spanning batch jobs, AI inference layers, and cross-region components, that middle layer matters more than ever. The Collector is built to handle this.
What is the OpenTelemetry Collector?
The OpenTelemetry Collector is a vendor-agnostic telemetry pipeline that sits between your applications and observability backends. While you could send telemetry directly from your SDKs to backends, you’ll want the Collector when you need to transform data, reduce costs through sampling, or avoid vendor lock-in.
It accepts data from multiple sources, processes it, and ships it to one or more destinations. It’s like a switchboard operator receiving, transforming, and forwarding telemetry without hard-wiring any app to a specific backend.
It supports common protocols and formats — OTLP, Jaeger, Zipkin, Prometheus, FluentBit, and others, and can export to systems like Last9, Kafka, Prometheus, Jaeger, etc. This means your services can stay loosely coupled to the backend stack, which makes upgrades, migrations, and experimentation a lot easier.
What makes the Collector especially useful is its modular design. Instead of a one-size-fits-all agent, it gives you building blocks to create custom pipelines, tailored to your infrastructure, traffic patterns, and cost limits.
Where Does the Collector Fit?
The Collector can run in a few different ways, depending on how your infrastructure is laid out:
-
Agent Mode: Deployed alongside each app (usually as a sidecar or daemon). Useful for injecting host-level metadata, filtering junk early, and reducing network chatter. Trade-off: Lower latency and better reliability, but higher resource overhead per service.
-
Gateway Mode: A central service that receives telemetry from multiple sources. Better suited for batching, routing, and processing at scale. Trade-off: Better for compliance and cost control, but creates a single point of failure and potential bottleneck.
-
Hybrid Mode: Combines the two. Agents handle local tasks, while the gateway takes care of the heavy lifting. This is the go-to setup for larger, distributed systems where you need both local processing and centralized control.
How the Collector Works: Pipelines
The OpenTelemetry Collector runs on a simple but flexible pipeline model.
Connectors are optional. They’re used when you need to route data between multiple pipelines, for example, converting traces to metrics or reusing telemetry in different workflows.
Here’s what each stage does:
-
Receivers
Entry points for telemetry. They support protocols like OTLP, Jaeger, Zipkin, and Prometheus. Once data is ingested, it’s translated into a consistent format. -
Processors
This is where data gets shaped. Processors can be chained together to drop unused or noisy data, enrich or modify metadata, apply sampling to reduce volume, or batch and compress payloads for efficiency. It’s the tuning stage, where you optimize signals for clarity, cost, and performance. -
Exporters
Responsible for sending data out. Whether it’s Last9, Prometheus, or Kafka, exporters handle delivery, retries, and any protocol conversions. -
Connectors
They’re used to route data between multiple pipelines, for example, converting traces to metrics or reusing telemetry across workflows. -
Extensions
Not in the core data path, but essential for operations. Extensions handle things like health checks, TLS, authentication, and debug endpoints like/metrics
,/debug/pprof
, and zPages.
Now, the Collector becomes your observability control plane, tuned for the kind of modern workloads that break traditional monitoring setups.
Otel Collector Pipelines: Types and Boundaries
When working with the Collector, it’s easy to focus on wiring together receivers, processors, and exporters. But an important design choice underpins all of this: traces, metrics, and logs are handled separately.
For example, trace pipelines may use sampling, while metrics pipelines often batch or aggregate data. This separation prevents data loss and instability, especially under load.
Here’s how pipeline isolation plays out in practice:
-
Trace pipelines
These deal with high cardinality and volume. You’ll often want to enrich spans with metadata, apply sampling to reduce noise, and batch them before export. Done right, this keeps your tracing backend from getting overwhelmed. -
Metrics pipelines
Metrics are more predictable, but they pile up fast, especially with containerized apps or scraping-heavy setups. Focus here tends to be on detecting resources, aggregating time series, and batching for efficient export. -
Log pipelines
Logs are messy, unstructured text, multiline events, and noisy debug statements. These pipelines often need parsing, filtering, formatting, or even redaction processors before the logs are usable.
Keeping pipelines isolated gives you a few key advantages.
For example, if there’s a sudden spike in log volume, say, someone enables verbose debug logs, the log pipeline might slow down or drop data. But your traces and metrics continue to flow just fine, because they’re running in separate pipelines.
You can also scale each pipeline based on its load. If the metrics traffic grows, you scale just the metrics pipeline. And when something breaks, isolation makes it easier to debug; you know exactly where to look without guessing which part of the system caused the issue.
Now the next step is understanding how those pipelines work — how data flows through them and why the order of each component matters.
Pipeline Overview
The OpenTelemetry Collector’s pipeline is a structured, performance-conscious flow designed to keep up with modern telemetry demands. From how data moves through each stage to how threads are allocated, every part is tuned for reliability under pressure.
Core Processors
NOTE: Processors in a Collector pipeline run in the order you define in your configuration. That order directly affects performance and reliability.
Here’s a typical production sequence:
processors: [memory_limiter, resourcedetection, attributes, batch]
These aren’t the bare minimum, but they’re usually the first set you add when moving beyond defaults, especially in production workloads.
Here’s what each essential processor does:
-
memory_limiter
: Prevents the Collector from running out of memory by capping total usage and allowing for controlled spikes.Absolute Memory Limits:
processors:memory_limiter:limit_mib: 2048 # Hard limit: 2GBspike_limit_mib: 512 # Allow spikes up to 2.5GB totalcheck_interval: 2sPercentage-Based Limits (Recommended):
processors:memory_limiter:limit_percentage: 80 # 80% of available system memoryspike_limit_percentage: 90 # Allow spikes up to 90%check_interval: 2sChoosing the Right Percentage:
- 70%: More conservative, triggers earlier, good for shared environments
- 80%: Balanced approach, common in production
- 90%: Aggressive, maximizes memory use, but higher risk of OOM
How Memory Limiting Works:
- The Collector starts rejecting new data when limits are hit
- Existing data continues processing
- Memory pressure reduces as data flows out
- Normal operation resumes when below the limits
Why use percentages: They automatically adapt to different deployment environments without manual tuning, making your configuration more portable across different infrastructures.
-
resourcedetection
: Auto-detects metadata from the environment, like cloud provider, region, or Kubernetes namespace.processors:resourcedetection:detectors: [env, system, docker, kubernetes]Beyond YAML: Visual Data Transformation
Managing complex transformation rules across environments gets messy with YAML alone. Last9’s Control Plane offers Extract/Remap/Sensitive Data through a visual interface:
- Extract: Pull specific fields from telemetry
- Remap: Transform field names and values
- Sensitive Data: Auto-detect and redact PII, API keys
- Fanout: Route data to different backends
Use YAML for simple, static rules. Use the UI for complex logic, team collaboration, and frequent changes.
-
attributes
: Gives you direct control over which fields to keep, drop, rename, or enrich.Typical uses:
- Remove sensitive fields (e.g., auth headers)
- Normalize service names
- Add deployment metadata from environment variables
processors:attributes:actions:- key: http.request.header.authorizationaction: delete- key: deployment.environmentvalue: ${env:ENVIRONMENT}action: upsertThis processor is your main tool for enforcing naming consistency and privacy policies across telemetry data.
-
batch
: Groups telemetry before export for better throughput and efficiency. Always place this last.processors:batch:timeout: 5ssend_batch_size: 512
Specialized Processors
Beyond the core processors, the Collector offers specialized processors for complex telemetry scenarios.
-
transform
: Complex Logic, DeclarativelyThe
transform
processor is more flexible and expressive thanattributes
. It uses the OpenTelemetry Transformation Language (OTTL) to apply conditions, regexes, and logic for advanced manipulation.Example use cases:
- Redact PII in span names, log messages, and metric labels
- Tag timeouts in error spans
- Modify metric names conditionally
- Filter sensitive data across all telemetry types
processors:transform:# Redact PII in span namestrace_statements:- context: spanstatements:- set(name, "redacted") where name matches ".*user_id.*"# Redact PII in log messageslog_statements:- context: logstatements:- set(body, "user email redacted") where body matches ".*@.*"# Redact PII in metric labelsmetric_statements:- context: metricstatements:- delete_key(attributes, "user_id") where attributes["user_id"] != nilUse
transform
when basic add/delete/update actions aren’t enough. -
resource
: Manual Metadata ControlWhile
resourcedetection
auto-discovers metadata, theresource
processor lets you explicitly add or override attributes:processors:resource:attributes:- key: environmentvalue: productionaction: upsert- key: service.versionvalue: ${env:APP_VERSION}action: insertPerfect for enforcing specific tagging policies or adding deployment-specific context.
Choosing the Right Processor
Purpose | Processor |
---|---|
Prevent crashes during telemetry spikes | memory_limiter |
Auto-attach environment/infrastructure metadata | resourcedetection |
Manually add or override metadata | resource |
Basic data cleanup and tagging | attributes |
Advanced transformations and conditional logic | transform |
Scaling and Performance Optimization
-
Concurrency and Threading
The Collector processes telemetry in parallel, and each pipeline component is designed to handle data concurrently. This gives you fine control over throughput and system behavior.
Here’s what that looks like across the pipeline:
-
Receivers: Usually single-threaded per endpoint. Keeps protocol handling predictable and resource usage low.
-
Processors: Can run in parallel, often scaled to the number of CPU cores. Useful for compute-heavy tasks like parsing logs, modifying spans, or applying sampling.
-
Exporters: Support multiple workers for concurrent export. You can tune batch sizes and worker counts to handle spikes in telemetry volume.
Example config:
processors:batch:send_batch_size: 8192timeout: 200msexporters:otlp:sending_queue:num_consumers: 16 # Parallel workersqueue_size: 5000 # Buffer sizeThis setup lets the Collector absorb high-throughput workloads without becoming a bottleneck, especially during spikes or failover events.
For receivers, concurrency options depend on the protocol. For example, the
otlp
receiver handles incoming gRPC or HTTP requests with built-in concurrency, but it doesn’t expose worker config. Some receivers, likeprometheus
, let you control scrape concurrency or intervals. It’s not one-size-fits-all, but tuning receivers usually involves scrape settings, buffer sizes, or connection limits rather than explicit worker counts. -
-
Backpressure and Flow Control
Backpressure in a telemetry pipeline isn’t just a performance issue; it’s a risk factor for data loss. Here’s how the data flows:
If the exporter queue fills up, maybe the backend is slow or unreachable, that pressure pushes back through the processor queue and eventually blocks the receivers. When receivers can’t take in new data, telemetry starts getting dropped at the source.
This is where the
memory_limiter
earns its keep. It doesn’t just cap memory use — it drops data early to prevent the entire pipeline from stalling.In production, you’ll usually spot backpressure through growing queues, higher memory use, or an increase in dropped data. These are your clues for where to scale or tune before things start breaking.
Extensions: Health Checks, Profiling, and Debugging
Beyond telemetry flow, the Collector supports extensions that provide critical operational hooks for health monitoring, diagnostics, and security. These aren’t part of the main pipeline but are essential in production environments.
-
Health Check: For Load Balancers and Orchestration
The
health_check
extension exposes a simple HTTP endpoint to monitor the Collector’s health. It’s used by Kubernetes probes, service meshes, or load balancers to verify liveness.extensions:health_check:endpoint: 0.0.0.0:13133path: /healthcheck_collector_pipeline:enabled: trueinterval: 5sexporter_failure_threshold: 5This ensures your Collector isn’t just up, but working, with its pipelines and exporters in good shape.
-
Performance Profiling (pprof)
Enabling
pprof
gives you low-level performance insight—CPU usage, memory allocation, and blocking operations.extensions:pprof:endpoint: localhost:1777Use this during staging or incident debugging. But in production, keep it locked down or disabled to avoid exposure and runtime overhead.
-
zPages: Real-Time Debugging
zPages
offers real-time visibility into the Collector’s internals. It’s a web UI that shows span processing, queue stats, and service health.extensions:zpages:endpoint: localhost:55679Useful endpoints:
-
/debug/pipelinez
: Status of pipelines -
/debug/servicez
: Service health
This can speed up troubleshooting when something feels “off” in the pipeline.
-
Troubleshooting Production Issues
Configuration Issues
Wrong processor order can break your pipeline in subtle ways:
-
Using
batch
too early: You’ll batch data that gets dropped or modified later, wasting compute and memory -
Skipping
memory_limiter
: Your Collector will hit resource limits and crash without warning -
Putting
attributes
afterbatch
: Attributes won’t be applied to already-batched data -
Missing
resourcedetection
early: Metadata won’t be available for downstream processors that need it -
Golden rule: Always sequence as
memory_limiter
→ detection/enrichment → filtering/sampling →batch
Runtime Issues
-
Memory Leaks and High Usage
What you’ll see: Collectors crashing randomly, getting killed, or just slowing down when traffic spikes or after running for a while.
Keep an eye on:
otelcol_process_memory_rss
: tracks actual memory used by the Collector. Sudden jumps? It could be a leak.otelcol_process_virtual_memory
: watches virtual memory, including mapped files, which can also creep up.
What to do: Use the
memory_limiter
processor to keep memory usage in check, ideally just below your container or pod limits.processors:memory_limiter:limit_mib: 2048spike_limit_mib: 512check_interval: 2sAlso, try running heap profiling (
pprof
) in staging to spot where memory is leaking or hogging resources. -
Export Failures and Retry Overhead
What you’ll see: Your backends aren’t getting data, or the Collector’s CPU spikes because it’s hammering retries.
Keep an eye on:
otelcol_exporter_send_failed_*
: Counts how many export attempts failed for each backend.otelcol_exporter_send_queue_size
: Growing queues mean data is backing up and not flowing out fast enough.
What to do: Set up retry policies and queues carefully for each exporter. Use circuit breakers if your backend goes down, so the Collector doesn’t drown in retries.
Also, enable debug logging for exporters to figure out if failures are random glitches or something bigger.
-
Backpressure and Dropped Data
What you’ll see: Data disappearing or delayed, receivers slowing down, or outright rejecting telemetry.
Keep an eye on:
otelcol_receiver_refused_spans
,otelcol_receiver_refused_metrics
,otelcol_receiver_refused_logs
: These show when receivers are overloaded and rejecting data.otelcol_queue_size_*
: Indicates if processor or exporter queues are filling up, which can lead to dropped data or lag.
What to do: Batch and compress telemetry to reduce payload size. Make sure your batch processor is the last step before exporting.
If CPU or memory use stays above 70%, it’s time to scale out horizontally.
Debugging Workflow
When things get tricky, turn up the Collector’s logging to debug mode to see what’s going on behind the scenes. Pair this with a logging exporter for plain-text insights.
service: telemetry: logs: level: debug
exporters: logging: loglevel: debug sampling_initial: 2 sampling_thereafter: 500
Don’t forget to use sampling so your logs don’t flood with data, but still show enough to catch problems.
Using Last9 with OTel Collector
The OpenTelemetry Collector gives you the control and flexibility you need for production observability. Both direct SDK exports and Collector-based approaches have their place - choose based on your specific needs.
Getting started with Last9 takes about 5 minutes. The Collector integrates easily using Prometheus remote write — just sign up, grab your credentials, and update your Collector config. Follow our OpenTelemetry Collector integration guide to get up and running.
This gives you enterprise-grade observability without the complexity of managing multiple vendor integrations or the performance overhead of direct SDK exports hitting multiple backends.
In our next part, we’ll explore designing reliable telemetry, covering logs, metrics, traces, plus signal modeling, cardinality, correlation, and alert design.