You have started with OpenTelemetry in a single service — maybe a staging API, a background worker, or a low-traffic job. The setup is straightforward: a couple of receivers, a batch processor with safe defaults, and one OTLP exporter to the backend. Memory usage is stable, CPU overhead is predictable, and you can afford to trace everything without sampling.
As you roll it out further, the scope changes. Now, it’s over 50 services across multiple clusters and regions, each with its own unique load profile. Some collectors handle sustained API traffic at tens of thousands of spans per second; others ingest bursty batches from scheduled jobs. Exporters need to send data to multiple destinations — some over mTLS-secured OTLP, others via Kafka for durability.
At this point, new variables appear in the pipeline:
- Configuration drift that drops resource attributes in one region but not another
- Batch processor settings that work for one workload but trigger memory restarts in another
- Exporters that stall under sustained load, leading to dropped spans or growing queues
Solving these case-by-case works in the moment, but over time, small differences between environments compound. Keeping the pipeline reliable at this scale means applying patterns that maintain consistency, handle variable workloads, and scale without sacrificing visibility.
In this part of the OTel series, we talk about the production rollout strategies that hold up across environments, configuration-as-code practices that work for multi-team setups, and techniques for sustaining performance while retaining full telemetry coverage.
The Real Performance Impact
Before deciding how to deploy OpenTelemetry at scale, let’s understand the typical overhead you can expect — so you can size collectors, tune processors, and plan capacity with intent.
Coroot’s load tests, summarized by InfoQ, measured the overhead of continuous tracing on a minimal HTTP service handling ~50k requests/sec:
- CPU — Around 35% higher usage compared to baseline, even with spare CPU capacity on the node.
- Memory — RSS increased by 5–8 MB, mainly from batching and buffering spans before export.
- Latency — P99 latency rose from ~10 ms to ~15 ms under sustained load.
- Network — Approximately 4 MB/s additional outbound traffic when sending unsampled, full request-level traces.
These numbers give you a concrete starting point. With this, you can design your rollout to:
- Set batch sizes that balance compression gains with memory headroom
- Apply sampling strategies where they deliver the most cost/performance benefit
- Decide if an agent-only setup will hold, or if you need gateway tiers for heavier processing
- Model horizontal scale and memory allocation before workloads hit production
OpenTelemetry Deployment Models at Scale
When you plan an OpenTelemetry rollout, you’ll usually see three main deployment approaches in production:
1. All-in-One Collector
In this pattern, all applications send telemetry directly to a single, centralized collector service. That collector is responsible for receiving, processing, and exporting all telemetry — metrics, logs, and traces — to one or more backends.
It’s the easiest to get started with:
- One deployment to manage
- One set of configuration files to maintain
- No coordination between multiple collectors or tiers
Because there’s only a single collector service, it’s well-suited for:
- Early-stage rollouts in staging or pre-production
- Small production environments where traffic patterns are predictable
- Single-team ownership of both applications and the telemetry pipeline
- Workloads that don’t require advanced routing, multi-tenant isolation, or complex sampling
As traffic grows, the all-in-one collector becomes the single choke point in your pipeline. Processing, batching, compression, and exporting all happen on the same instance, so CPU, memory, and queue depth all scale together.
Signs you’re approaching the limits
Instead of relying on fixed service counts or throughput numbers, watch for operational signals:
- Queue utilisation — exporter or processor queues consistently running at 60–70% capacity for extended periods
- Memory pressure — the
memory_limiter
processor starts dropping spans or metrics - CPU saturation — sustained CPU usage leaves little headroom for bursts, often caused by encryption or heavy compression in exporters
- Export delays — flush times get longer, increasing end-to-end latency for telemetry delivery
When queues start backing up or memory pressure becomes routine, scaling a single collector vertically only buys you so much time. At that point, moving to an agent-only or agent–gateway hybrid model lets you distribute load and apply processing closer to the data source.
2. Agent-Only Collectors — Per-host resilience, distributed management
In this pattern, a lightweight collector runs alongside workloads on every node — usually as a Kubernetes DaemonSet or sidecar. Each agent batches, compresses, and enriches telemetry locally before sending it directly to the backend.
Because processing happens close to the workload, this pattern adds resilience against transient network or backend slowdowns. Exporters can retry and buffer locally without introducing a single ingestion bottleneck for the entire cluster.
When it’s a fit
- Host-level metadata is important for analysis (e.g., using the Resource Detection Processor)
- Backends can ingest directly from many nodes without requiring a central routing tier
- If you want to isolate telemetry from each node to contain the blast radius of any collector failure
Strengths
- Local buffering and retry —
sending_queue
andmemory_limiter
prevent short network blips from causing immediate drops - Consistent enrichment — resource and environment tags applied at the source
- No single ingestion bottleneck — each agent operates independently
Operational trade-offs
- Configuration changes — sampling, routing, or exporter updates - must be rolled out to every agent; this increases coordination effort in larger fleets
- Without consistent configuration management, drift can emerge between nodes or clusters
- Local queues can still overflow if sustained traffic exceeds the configured limits; monitor queue occupancy and memory consumption to determine safe headroom
Here’s an example of a Kubernetes DaemonSet agent config:
extensions: health_check: endpoint: 0.0.0.0:13133
processors: batch: send_batch_size: 1024 send_batch_max_size: 1500 timeout: 200ms memory_limiter: limit_mib: 256 spike_limit_mib: 64 check_interval: 5s resource: attributes: - key: k8s.cluster.name value: ${K8S_CLUSTER_NAME} action: upsert resourcedetection: detectors: [env, system, k8snode] override: false
exporters: otlp: endpoint: gateway-collector.monitoring.svc:4317 compression: gzip tls: insecure: false sending_queue: enabled: true num_consumers: 4 queue_size: 1000 retry_on_failure: enabled: true initial_interval: 1s max_interval: 30s max_elapsed_time: 300s
Scaling considerations
Instead of relying on fixed span/sec numbers, base scaling decisions on real telemetry from your agents:
- Queue utilisation — if
sending_queue
usage regularly exceeds 60–70%, either increase capacity or scale out - Memory pressure — watch for the
memory_limiter
dropping data; if this happens under normal load, adjust limits or batch sizes - Exporter throughput — monitor exporter send rates vs. backend ingest capacity; saturation here means either sampling more or adding an intermediate tier
3. Agent–Gateway Hybrid — Designed for scale and centralized control
In this model, agent collectors run per host or pod, performing lightweight tasks like batching, compression, and basic enrichment. They forward telemetry to a smaller fleet of gateway collectors that handle heavier, centralized processing: tail-based sampling, cross-service correlation, multi-backend routing, and global redaction/enrichment policies.
When it’s a fit
- You need to apply routing, sampling, or redaction rules across many services without modifying workloads
- Multiple backend destinations require different export formats or protocols (e.g., traces to multiple vendors, metrics via
remote_write
) - Tail sampling or cross-trace correlation requires a global view of traffic
- You want to keep enrichment and filtering rules consistent across teams and environments
Strengths
- Centralized policy changes can be applied instantly without redeploying workloads
- Supports complex routing — for example, sending traces to two APM vendors while sending metrics to Prometheus
- Easier to test and roll out global enrichment, sampling, or redaction policies
Operational considerations
- Deploy gateways in high-availability pairs per region to prevent ingestion loss during failures or upgrades
- Validate policy and routing changes in a staging environment — a misconfiguration here can affect all incoming telemetry
- Monitor exporter and processor queue utilisation — if gateway-level queues remain consistently high for extended periods, consider scaling out gateways horizontally or increasing queue capacity.
- Watch memory limiter metrics — consistent drops indicate the need to tune processor settings or increase memory allocation
Example: Gateway collector configuration
extensions: health_check: endpoint: 0.0.0.0:13133
processors: batch: send_batch_size: 8192 send_batch_max_size: 8192 timeout: 5s memory_limiter: limit_mib: 1500 spike_limit_mib: 300 tail_sampling: decision_wait: 30s num_traces: 50000 expected_new_traces_per_sec: 100 policies: - name: errors_and_slow type: composite composite: operator: or policies: - name: errors type: status_code status_code: { status_codes: [ERROR] } - name: slow_requests type: latency latency: { threshold_ms: 2000 } - name: sample_normal type: probabilistic probabilistic: { sampling_percentage: 5 }
exporters: otlp: endpoint: https://your-backend.com compression: gzip sending_queue: enabled: true num_consumers: 10 queue_size: 5000 retry_on_failure: enabled: true initial_interval: 1s max_interval: 30s max_elapsed_time: 300s
Why this pattern scales well
By splitting lightweight collection (agents) from heavier, centralized processing (gateways), you:
- Keep per-node collectors small and efficient
- Isolate heavy operations like tail sampling from workloads
- Gain flexibility to scale collection and processing independently
Beyond Agent–Gateway
The agent–gateway pattern gives you a strong baseline: agents handle local batching and enrichment, gateways manage policy and routing. But at high volumes or across globally distributed teams, two extensions can prevent bottlenecks and improve resilience — message queues for durability, and federated collectors for regional scale.
Durable Buffering with Message Queues
When sustained throughput or sudden bursts start to push a gateway’s processing and queue limits, you risk data loss or backpressure on workloads. Placing a durable, high-throughput message bus — such as Kafka, Amazon MSK, Google Pub/Sub, or Azure Event Hubs — between agents and gateways decouples collection from aggregation.
This approach offers several benefits:
- Bursts can be absorbed without dropping data.
- Enables catch-up after downstream slowdowns.
- Lets you scale processing tiers independently of ingestion.
It works best in scenarios such as:
- Traffic regularly reaching tens or hundreds of MB/s.
- Workloads with highly variable p99 latencies or queue depths.
- Multiple backends needing the same telemetry without changing every agent’s configuration.
The trade-offs are operational: queues can still lose telemetry if retention windows expire before processing, so align retention with worst-case recovery times; consumer lag can delay visibility, which can be mitigated by scaling out consumers or optimising processing throughput; and poor topic partitioning can create uneven load, avoidable by choosing partition keys that distribute traffic evenly across brokers.
Here’s how a typical flow looks:
Application → Agent Collector → Kafka → Aggregator Collector → Backend(s)
Federated Collectors for Global Scale
In multi-region environments, routing all telemetry to a single central gateway adds cross-region latency, increases transfer costs, and risks turning that gateway into a single choke point. Federated deployments solve this by introducing regional gateways — lightweight aggregation points that process data locally, apply enrichment and filtering, and then forward upstream to a global gateway.
This keeps ingestion latency low for local workloads, respects data sovereignty boundaries, and isolates failures to the affected region. It’s particularly effective in active–active architectures where regions need to operate independently but still feed into a unified backend. The challenges are mainly around governance: ensuring regional policies don’t drift, forwarding rules are correct, and upstream outages don’t silently fill buffers without alerting.
A common flow is:
Application → Agent Collector → Regional Gateway → Global Gateway → Backend(s)
Build Predictable Collector Configurations
With OTel adoption, the collector’s configuration effectively becomes part of your production infrastructure. Changes to batch processor sizes, memory limits, or exporter settings can have cluster-wide effects, so it’s worth managing them with the same discipline as application code — generated, validated, versioned, and deployed through CI/CD.
The Collector’s validate
command is a useful safeguard for syntax and schema correctness before rollout. It won’t catch every problem — for example, it can’t detect circular dependencies or mismatched component types in pipelines — but when combined with consistent generation and deployment practices, it helps maintain a reliable baseline across environments.
Template-Driven Config Generation
Maintaining consistency across environments is easier when configuration is generated from a single typed source. For example, a Go or Python struct can output final YAML, ensuring that batch sizes, memory limits, exporters, and headers stay aligned without repetitive edits:
type CollectorConfig struct { Environment string Team string SamplingRate float64 DataRetention string BackendURL string MemoryLimitMB int BatchSize int}
This approach is especially valuable when:
- Multiple environments or teams require small variations
- Guardrails are needed on performance-sensitive settings
- Configuration changes must be made centrally and rolled out automatically
Deploy Collectors with Infrastructure as Code
Using IaC tools like Terraform or Helm keeps collector deployments reproducible and auditable. This ensures the same configuration patterns are applied across clusters, regions, or tiers — whether agents, gateways, or federated collectors — and makes scaling more predictable.
resource "kubernetes_deployment" "otel_collector" { count = var.collector_instances metadata { name = "otel-collector-${count.index}" namespace = var.namespace } spec { template { spec { container { name = "otelcol" image = "otel/opentelemetry-collector-contrib:${var.otel_version}" } } } }}
Validate Configurations in CI
Adding configuration validation to CI pipelines catches syntax errors, missing processors, or unintentional changes to critical settings before they reach production. The native validator can be paired with custom static checks — for example, detecting high-cardinality attributes or sensitive fields.
CONFIG_FILE="${1:-collector.yaml}"otelcol validate --config="${CONFIG_FILE}"grep -E "(user_id|session_id|request_id)" "${CONFIG_FILE}" && echo "⚠️ High-cardinality attribute detected"
While otelcol validate
is not a full semantic validator, it’s a low-friction step that, combined with IaC and templating, helps keep your collector fleet’s configuration predictable and safe to change at scale.
Reduce Telemetry Costs Without Losing Coverage
Once the pipeline is stable, the next challenge is cost. Network transfer, collector processing, and backend storage can all grow faster than expected as coverage expands. The aim is to keep the same level of visibility while eliminating unnecessary overhead.
Optimize Collector-to-Collector Links
For high-volume internal hops — such as agent → gateway or regional → global — the OTLP Arrow protocol can dramatically reduce network traffic.
In production deployments, OTLP Arrow achieves 30-70% bandwidth reduction compared to OTLP with zstd compression, with ServiceNow reporting 30% improvement in similarly configured trace pipelines and up to 10x compression versus uncompressed OTLP in synthetic benchmarks.
In ServiceNow’s production deployment, Arrow delivered a 30–70% reduction in network bandwidth compared to OTLP/gRPC using large batch sizes with zstd
compression. In controlled benchmarks, it achieved up to a 10× improvement over uncompressed OTLP.
Arrow achieves this by using a columnar data format with higher compression efficiency. Deployments need Arrow-capable receivers and exporters, along with sufficient CPU and memory to process the format efficiently.
In lower-throughput or resource-constrained collectors, standard OTLP with compression can be the more practical choice — especially if network usage is already within budget and CPU headroom is limited.
When to use Arrow: Ideal for internal collector-to-collector links carrying sustained high-volume traffic, multi-backend fan-out, or long-haul regional aggregation where bandwidth savings directly reduce cost or latency.
When to use standard OTLP: Best suited for moderate-throughput pipelines, edge collectors with tight resource limits, or environments where operational simplicity outweighs maximum compression gains.
Manage Fleets with OpAMP
OpAMP enables remote configuration updates, health reporting, and effective configuration retrieval for large collector fleets. Key capabilities include AcceptsRemoteConfig
, ReportsEffectiveConfig
, and ReportsHealth
. This is especially valuable for multi-region deployments where consistency is critical.
OpAMP support varies significantly across collector distributions and deployment modes. The protocol specification is stable, but implementations range from beta to development status. The OpenTelemetry Collector’s OpAMP extension and supervisor modes have different maturity levels, so evaluate your specific use case and test thoroughly in non-production environments first.
extensions: opamp: server: ws: endpoint: wss://opamp-server.company.com/v1/opamp capabilities: - AcceptsRemoteConfig - ReportsEffectiveConfig - ReportsHealth
Adjust Sampling Dynamically
Adaptive sampling tunes capture rates based on live traffic, backend load, or budget targets. This can reduce ingestion volume without dropping high-value spans. The probabilistic sampler is one straightforward way to implement this.
processors: probabilistic_sampler: hash_seed: 22 sampling_percentage: ${OTEL_SAMPLING_RATE:10}
Tier Storage by Data Value
Route telemetry into hot, warm, or cold storage tiers based on retention policy or usage requirements. Recent critical data stays on low-latency storage; older or lower-priority data shifts to cheaper storage.
processors: routing/storage_tier: from_attribute: span.retention_policy table: - value: hot exporters: [otlp/hot-storage] - value: warm exporters: [otlp/warm-storage] - value: cold exporters: [otlp/cold-storage]
Reduce Volume at the Edge
Dropping low-value telemetry at the source has a compounding cost benefit. Use the filter processor to exclude noise, and the attributes processor to trim or hash high-cardinality fields that aren’t needed in full.
processors: filter/noise_reduction: traces: span: - 'attributes["http.route"] == "/health"' - 'attributes["user_agent.original"] contains "synthetic"' attributes/cardinality_reduction: actions: - key: http.url action: truncate truncate_length: 100 - key: user.id action: hash
This keeps your telemetry pipeline lean without sacrificing the insights you rely on. The same architectural patterns that make your collectors predictable also make them cost-efficient — if you apply the right protocols, routing, and filtering from the start.
Secure and Govern Your Telemetry Pipeline
Security and compliance in OpenTelemetry is about embedding trust into every stage of the telemetry lifecycle — from collection to storage. That means encrypting data in transit, stripping or transforming sensitive fields before export, enforcing strict change controls on collector configurations, and continuously verifying that the collectors themselves are healthy.
Encrypt Telemetry in Transit
Every hop between applications, collectors, and backends should use TLS. Without encryption, any point in the network path becomes a potential inspection or injection vector. TLS not only prevents eavesdropping but also ensures integrity — data cannot be modified mid-flight without detection.
In practice, this means configuring TLS settings on both sides of an OTLP gRPC connection:
- Collectors validating backend identities
- Applications validating collector identities
Even within private networks, TLS provides a low-cost safeguard against misconfiguration or insider threats.
Prevent Sensitive Data from Reaching Storage
Not all telemetry fields should be stored. Personally identifiable information (PII), credentials, and other sensitive identifiers should be removed, masked, or hashed before leaving your control.
The attributes
processor can:
- Replace values with placeholders (
***REDACTED***
) - Remove entire attributes
- Hash values to preserve correlation without exposing raw data
Example use cases:
- Strip
Authorization
andCookie
headers - Mask
email
orcredit_card
parameters in URLs - Hash
user.id
orcustomer.id
Applying these transformations at the collector ensures the same compliance rules apply to all services.
Control Who Can Modify Collectors
Collectors aren’t passive — their configuration determines what data is collected, processed, and sent. In Kubernetes, RBAC policies should restrict who can:
- Modify collector deployments
- View sensitive configurations
- Restart collector pods
This prevents unauthorized changes — accidental or intentional — from disrupting telemetry pipelines or weakening compliance controls.
Monitor Collector Health
If a collector becomes overloaded or misconfigured, observability can degrade without warning. Health monitoring should include both liveness and deep operational metrics:
- Health checks — confirm the collector process is running and responsive
- pprof profiling endpoints — diagnose CPU or memory bottlenecks
- Internal collector metrics — track refused spans, failed exports, queue sizes, and memory usage to detect backpressure early
For memory management, use Go’s GOMEMLIMIT
or equivalent environment variables supported by your collector distribution to keep garbage collection predictable under load. The deprecated memory_ballast
extension should not be used.
These metrics should feed into your monitoring and alerting stack (such as Prometheus + Alertmanager) with thresholds for:
- High memory usage
- Queue growth
- Span drop rates
Proactively Verify Health
Threshold-based alerts only trigger when limits are crossed — silent degradation can still happen. Periodic verification adds another layer of assurance. Lightweight scheduled jobs (for example, Kubernetes CronJobs) can:
- Confirm health endpoints are reachable
- Compare export rates to expected baselines
- Check queue depths remain within limits
Running these checks every few minutes lets you catch partial outages or misconfigurations before they create a wider visibility gap.
Troubleshoot Problems in OpenTelemetry Deployments
A stalled exporter can hide a backend bottleneck. A memory spike might be caused by unbounded cardinality upstream.
The fastest way to isolate the real issue is to walk the telemetry path in order — from collection to export to storage — confirming each stage before moving on. This prevents fixing the wrong layer and introducing new instability.
Check the collector fleet health
If ingestion is unstable, any downstream checks will be unreliable. Start with:
- Pod or process state — all running, no unexplained restarts (
kubectl get pods
) - Version consistency — all instances on the expected collector release
- Recent changes — config updates, scaling events, or deployments during the incident window
If you find drift or instability here, resolve it before moving on.
Verify export path integrity
Once collectors look healthy, confirm they’re successfully sending telemetry to the next hop. Compare:
otelcol_exporter_sent_spans
vsotelcol_exporter_send_failed_spans
over the last 5–10 minutes- Failure rate — sustained >5% failure rate across gateways often points to a shared dependency issue
- Backend handshake latency and gRPC/HTTP status codes
Consistent failures across multiple gateways typically indicate a backend outage, network problem, or authentication issue.
Assess queue behaviour
Exporter queues are an early warning sign of downstream pressure. Check:
otelcol_exporter_queue_size
vsotelcol_exporter_queue_capacity
- Healthy — queue occupancy fluctuates below ~60% (scaling guidance)
- Unhealthy — queues fill and stay >70%, indicating a bottleneck further downstream
If using Kafka or another broker, also monitor consumer lag against your SLA. When queues saturate, consider reducing batch sizes to flush more often, applying filtering earlier, or scaling out consumers in the aggregation tier.
Investigate memory pressure
High memory usage in collectors often points to:
- Tail sampling buffers sized too large for current throughput (
tail_sampling
processor) - Surges in span or metric cardinality
- Processor backlogs from slow exporters
Use thememory_limiter
processor metrics likeotelcol_processor_refused_spans
to spot drops. Adjust limits, right-size batch processors, and control unbounded attributes early in the pipeline.
Correlate ingestion with backend performance
If ingestion is stable but queries are slow or timing out:
- Compare active series or span counts to historical baselines
- Confirm retention and storage tiering are routing data as intended
- Check for ingestion peaks that align with query latency spikes
Routing high-cardinality telemetry to warm or cold tiers earlier can protect hot-tier query SLAs without sacrificing retention.
Work in sequence
By moving through these stages — fleet health → export integrity → queue state → memory profile → backend behaviour — you can pinpoint where the deviation starts and act at the correct layer. This keeps the investigation scoped and avoids restarts or broad reconfigurations that mask the root cause.
Turning Deployment Patterns into Operational Intelligence
The deployment patterns above — agent–gateway, federated tiers, message queue buffering — only deliver value when you can see their impact in production. Most teams deploy these patterns blindly, then discover issues during incidents.
Last9 bridges this visibility gap by providing unified observability for your OpenTelemetry deployment:
• Unified telemetry correlation — Traces, metrics, and logs from your collectors appear in a single interface, making it easier to connect deployment changes to system behavior
• High-cardinality support — Rich resource attributes and collector metadata don’t slow down queries or inflate costs, so you can maintain detailed visibility at scale
• Real-time impact assessment — Configuration changes to collectors become visible immediately in data volume, processing patterns, and system performance
• OpenTelemetry-native integration — No additional instrumentation or configuration required to start visualizing collector behavior
The operational advantage: When you modify collector configurations — adjusting batch sizes, changing processors, or updating routing rules — you can immediately assess the impact rather than waiting for the next outage to reveal problems.
To see your deployment design translate into operational visibility, talk to our team, or if you want to get started at your own pace, start for free!
In our next piece, we’ll dig into how to pick the right backend — weighing hosted versus self-hosted setups, understanding vendor lock-in risks, checking compatibility, and breaking down cost models like events-based versus GB-based pricing.