Last9

Deploying OpenTelemetry at Scale: Production Patterns That Work

Understand the patterns for deploying OpenTelemetry at scale to achieve reliability, control costs, and maintain operational clarity.

Sep 10th, ‘25
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

You have started with OpenTelemetry in a single service — maybe a staging API, a background worker, or a low-traffic job. The setup is straightforward: a couple of receivers, a batch processor with safe defaults, and one OTLP exporter to the backend. Memory usage is stable, CPU overhead is predictable, and you can afford to trace everything without sampling.

As you roll it out further, the scope changes. Now, it’s over 50 services across multiple clusters and regions, each with its own unique load profile. Some collectors handle sustained API traffic at tens of thousands of spans per second; others ingest bursty batches from scheduled jobs. Exporters need to send data to multiple destinations — some over mTLS-secured OTLP, others via Kafka for durability.

At this point, new variables appear in the pipeline:

  • Configuration drift that drops resource attributes in one region but not another
  • Batch processor settings that work for one workload but trigger memory restarts in another
  • Exporters that stall under sustained load, leading to dropped spans or growing queues

Solving these case-by-case works in the moment, but over time, small differences between environments compound. Keeping the pipeline reliable at this scale means applying patterns that maintain consistency, handle variable workloads, and scale without sacrificing visibility.

In this part of the OTel series, we talk about the production rollout strategies that hold up across environments, configuration-as-code practices that work for multi-team setups, and techniques for sustaining performance while retaining full telemetry coverage.

The Real Performance Impact

Before deciding how to deploy OpenTelemetry at scale, let’s understand the typical overhead you can expect — so you can size collectors, tune processors, and plan capacity with intent.

Coroot’s load tests, summarized by InfoQ, measured the overhead of continuous tracing on a minimal HTTP service handling ~50k requests/sec:

  • CPU — Around 35% higher usage compared to baseline, even with spare CPU capacity on the node.
  • Memory — RSS increased by 5–8 MB, mainly from batching and buffering spans before export.
  • Latency — P99 latency rose from ~10 ms to ~15 ms under sustained load.
  • Network — Approximately 4 MB/s additional outbound traffic when sending unsampled, full request-level traces.

These numbers give you a concrete starting point. With this, you can design your rollout to:

  • Set batch sizes that balance compression gains with memory headroom
  • Apply sampling strategies where they deliver the most cost/performance benefit
  • Decide if an agent-only setup will hold, or if you need gateway tiers for heavier processing
  • Model horizontal scale and memory allocation before workloads hit production

OpenTelemetry Deployment Models at Scale

When you plan an OpenTelemetry rollout, you’ll usually see three main deployment approaches in production:

1. All-in-One Collector

In this pattern, all applications send telemetry directly to a single, centralized collector service. That collector is responsible for receiving, processing, and exporting all telemetry — metrics, logs, and traces — to one or more backends.

It’s the easiest to get started with:

  • One deployment to manage
  • One set of configuration files to maintain
  • No coordination between multiple collectors or tiers

Because there’s only a single collector service, it’s well-suited for:

  • Early-stage rollouts in staging or pre-production
  • Small production environments where traffic patterns are predictable
  • Single-team ownership of both applications and the telemetry pipeline
  • Workloads that don’t require advanced routing, multi-tenant isolation, or complex sampling

As traffic grows, the all-in-one collector becomes the single choke point in your pipeline. Processing, batching, compression, and exporting all happen on the same instance, so CPU, memory, and queue depth all scale together.

Signs you’re approaching the limits
Instead of relying on fixed service counts or throughput numbers, watch for operational signals:

  • Queue utilisation — exporter or processor queues consistently running at 60–70% capacity for extended periods
  • Memory pressure — the memory_limiter processor starts dropping spans or metrics
  • CPU saturation — sustained CPU usage leaves little headroom for bursts, often caused by encryption or heavy compression in exporters
  • Export delays — flush times get longer, increasing end-to-end latency for telemetry delivery

When queues start backing up or memory pressure becomes routine, scaling a single collector vertically only buys you so much time. At that point, moving to an agent-only or agent–gateway hybrid model lets you distribute load and apply processing closer to the data source.

2. Agent-Only Collectors — Per-host resilience, distributed management

In this pattern, a lightweight collector runs alongside workloads on every node — usually as a Kubernetes DaemonSet or sidecar. Each agent batches, compresses, and enriches telemetry locally before sending it directly to the backend.

Because processing happens close to the workload, this pattern adds resilience against transient network or backend slowdowns. Exporters can retry and buffer locally without introducing a single ingestion bottleneck for the entire cluster.

When it’s a fit

  • Host-level metadata is important for analysis (e.g., using the Resource Detection Processor)
  • Backends can ingest directly from many nodes without requiring a central routing tier
  • If you want to isolate telemetry from each node to contain the blast radius of any collector failure

Strengths

  • Local buffering and retrysending_queue and memory_limiter prevent short network blips from causing immediate drops
  • Consistent enrichment — resource and environment tags applied at the source
  • No single ingestion bottleneck — each agent operates independently

Operational trade-offs

  • Configuration changes — sampling, routing, or exporter updates - must be rolled out to every agent; this increases coordination effort in larger fleets
  • Without consistent configuration management, drift can emerge between nodes or clusters
  • Local queues can still overflow if sustained traffic exceeds the configured limits; monitor queue occupancy and memory consumption to determine safe headroom

Here’s an example of a Kubernetes DaemonSet agent config:

extensions:
health_check:
endpoint: 0.0.0.0:13133
processors:
batch:
send_batch_size: 1024
send_batch_max_size: 1500
timeout: 200ms
memory_limiter:
limit_mib: 256
spike_limit_mib: 64
check_interval: 5s
resource:
attributes:
- key: k8s.cluster.name
value: ${K8S_CLUSTER_NAME}
action: upsert
resourcedetection:
detectors: [env, system, k8snode]
override: false
exporters:
otlp:
endpoint: gateway-collector.monitoring.svc:4317
compression: gzip
tls:
insecure: false
sending_queue:
enabled: true
num_consumers: 4
queue_size: 1000
retry_on_failure:
enabled: true
initial_interval: 1s
max_interval: 30s
max_elapsed_time: 300s

Scaling considerations
Instead of relying on fixed span/sec numbers, base scaling decisions on real telemetry from your agents:

  • Queue utilisation — if sending_queue usage regularly exceeds 60–70%, either increase capacity or scale out
  • Memory pressure — watch for the memory_limiter dropping data; if this happens under normal load, adjust limits or batch sizes
  • Exporter throughput — monitor exporter send rates vs. backend ingest capacity; saturation here means either sampling more or adding an intermediate tier

3. Agent–Gateway Hybrid — Designed for scale and centralized control

In this model, agent collectors run per host or pod, performing lightweight tasks like batching, compression, and basic enrichment. They forward telemetry to a smaller fleet of gateway collectors that handle heavier, centralized processing: tail-based sampling, cross-service correlation, multi-backend routing, and global redaction/enrichment policies.

When it’s a fit

  • You need to apply routing, sampling, or redaction rules across many services without modifying workloads
  • Multiple backend destinations require different export formats or protocols (e.g., traces to multiple vendors, metrics via remote_write)
  • Tail sampling or cross-trace correlation requires a global view of traffic
  • You want to keep enrichment and filtering rules consistent across teams and environments

Strengths

  • Centralized policy changes can be applied instantly without redeploying workloads
  • Supports complex routing — for example, sending traces to two APM vendors while sending metrics to Prometheus
  • Easier to test and roll out global enrichment, sampling, or redaction policies

Operational considerations

  • Deploy gateways in high-availability pairs per region to prevent ingestion loss during failures or upgrades
  • Validate policy and routing changes in a staging environment — a misconfiguration here can affect all incoming telemetry
  • Monitor exporter and processor queue utilisation — if gateway-level queues remain consistently high for extended periods, consider scaling out gateways horizontally or increasing queue capacity.
  • Watch memory limiter metrics — consistent drops indicate the need to tune processor settings or increase memory allocation

Example: Gateway collector configuration

extensions:
health_check:
endpoint: 0.0.0.0:13133
processors:
batch:
send_batch_size: 8192
send_batch_max_size: 8192
timeout: 5s
memory_limiter:
limit_mib: 1500
spike_limit_mib: 300
tail_sampling:
decision_wait: 30s
num_traces: 50000
expected_new_traces_per_sec: 100
policies:
- name: errors_and_slow
type: composite
composite:
operator: or
policies:
- name: errors
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow_requests
type: latency
latency: { threshold_ms: 2000 }
- name: sample_normal
type: probabilistic
probabilistic: { sampling_percentage: 5 }
exporters:
otlp:
endpoint: https://your-backend.com
compression: gzip
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
retry_on_failure:
enabled: true
initial_interval: 1s
max_interval: 30s
max_elapsed_time: 300s

Why this pattern scales well
By splitting lightweight collection (agents) from heavier, centralized processing (gateways), you:

  • Keep per-node collectors small and efficient
  • Isolate heavy operations like tail sampling from workloads
  • Gain flexibility to scale collection and processing independently

Beyond Agent–Gateway

The agent–gateway pattern gives you a strong baseline: agents handle local batching and enrichment, gateways manage policy and routing. But at high volumes or across globally distributed teams, two extensions can prevent bottlenecks and improve resilience — message queues for durability, and federated collectors for regional scale.

Durable Buffering with Message Queues

When sustained throughput or sudden bursts start to push a gateway’s processing and queue limits, you risk data loss or backpressure on workloads. Placing a durable, high-throughput message bus — such as Kafka, Amazon MSK, Google Pub/Sub, or Azure Event Hubs — between agents and gateways decouples collection from aggregation.

This approach offers several benefits:

  • Bursts can be absorbed without dropping data.
  • Enables catch-up after downstream slowdowns.
  • Lets you scale processing tiers independently of ingestion.

It works best in scenarios such as:

  • Traffic regularly reaching tens or hundreds of MB/s.
  • Workloads with highly variable p99 latencies or queue depths.
  • Multiple backends needing the same telemetry without changing every agent’s configuration.

The trade-offs are operational: queues can still lose telemetry if retention windows expire before processing, so align retention with worst-case recovery times; consumer lag can delay visibility, which can be mitigated by scaling out consumers or optimising processing throughput; and poor topic partitioning can create uneven load, avoidable by choosing partition keys that distribute traffic evenly across brokers.

Here’s how a typical flow looks:

Application → Agent Collector → Kafka → Aggregator Collector → Backend(s)

Federated Collectors for Global Scale

In multi-region environments, routing all telemetry to a single central gateway adds cross-region latency, increases transfer costs, and risks turning that gateway into a single choke point. Federated deployments solve this by introducing regional gateways — lightweight aggregation points that process data locally, apply enrichment and filtering, and then forward upstream to a global gateway.

This keeps ingestion latency low for local workloads, respects data sovereignty boundaries, and isolates failures to the affected region. It’s particularly effective in active–active architectures where regions need to operate independently but still feed into a unified backend. The challenges are mainly around governance: ensuring regional policies don’t drift, forwarding rules are correct, and upstream outages don’t silently fill buffers without alerting.

A common flow is:

Application → Agent Collector → Regional Gateway → Global Gateway → Backend(s)

Build Predictable Collector Configurations

With OTel adoption, the collector’s configuration effectively becomes part of your production infrastructure. Changes to batch processor sizes, memory limits, or exporter settings can have cluster-wide effects, so it’s worth managing them with the same discipline as application code — generated, validated, versioned, and deployed through CI/CD.

The Collector’s validate command is a useful safeguard for syntax and schema correctness before rollout. It won’t catch every problem — for example, it can’t detect circular dependencies or mismatched component types in pipelines — but when combined with consistent generation and deployment practices, it helps maintain a reliable baseline across environments.

Template-Driven Config Generation

Maintaining consistency across environments is easier when configuration is generated from a single typed source. For example, a Go or Python struct can output final YAML, ensuring that batch sizes, memory limits, exporters, and headers stay aligned without repetitive edits:

type CollectorConfig struct {
Environment string
Team string
SamplingRate float64
DataRetention string
BackendURL string
MemoryLimitMB int
BatchSize int
}

This approach is especially valuable when:

  • Multiple environments or teams require small variations
  • Guardrails are needed on performance-sensitive settings
  • Configuration changes must be made centrally and rolled out automatically

Deploy Collectors with Infrastructure as Code

Using IaC tools like Terraform or Helm keeps collector deployments reproducible and auditable. This ensures the same configuration patterns are applied across clusters, regions, or tiers — whether agents, gateways, or federated collectors — and makes scaling more predictable.

resource "kubernetes_deployment" "otel_collector" {
count = var.collector_instances
metadata {
name = "otel-collector-${count.index}"
namespace = var.namespace
}
spec {
template {
spec {
container {
name = "otelcol"
image = "otel/opentelemetry-collector-contrib:${var.otel_version}"
}
}
}
}
}

Validate Configurations in CI

Adding configuration validation to CI pipelines catches syntax errors, missing processors, or unintentional changes to critical settings before they reach production. The native validator can be paired with custom static checks — for example, detecting high-cardinality attributes or sensitive fields.

CONFIG_FILE="${1:-collector.yaml}"
otelcol validate --config="${CONFIG_FILE}"
grep -E "(user_id|session_id|request_id)" "${CONFIG_FILE}" &&
echo "⚠️ High-cardinality attribute detected"

While otelcol validate is not a full semantic validator, it’s a low-friction step that, combined with IaC and templating, helps keep your collector fleet’s configuration predictable and safe to change at scale.

Reduce Telemetry Costs Without Losing Coverage

Once the pipeline is stable, the next challenge is cost. Network transfer, collector processing, and backend storage can all grow faster than expected as coverage expands. The aim is to keep the same level of visibility while eliminating unnecessary overhead.

For high-volume internal hops — such as agent → gateway or regional → global — the OTLP Arrow protocol can dramatically reduce network traffic.

In production deployments, OTLP Arrow achieves 30-70% bandwidth reduction compared to OTLP with zstd compression, with ServiceNow reporting 30% improvement in similarly configured trace pipelines and up to 10x compression versus uncompressed OTLP in synthetic benchmarks.

In ServiceNow’s production deployment, Arrow delivered a 30–70% reduction in network bandwidth compared to OTLP/gRPC using large batch sizes with zstd compression. In controlled benchmarks, it achieved up to a 10× improvement over uncompressed OTLP.

Arrow achieves this by using a columnar data format with higher compression efficiency. Deployments need Arrow-capable receivers and exporters, along with sufficient CPU and memory to process the format efficiently.

In lower-throughput or resource-constrained collectors, standard OTLP with compression can be the more practical choice — especially if network usage is already within budget and CPU headroom is limited.

When to use Arrow: Ideal for internal collector-to-collector links carrying sustained high-volume traffic, multi-backend fan-out, or long-haul regional aggregation where bandwidth savings directly reduce cost or latency.

When to use standard OTLP: Best suited for moderate-throughput pipelines, edge collectors with tight resource limits, or environments where operational simplicity outweighs maximum compression gains.

Manage Fleets with OpAMP

OpAMP enables remote configuration updates, health reporting, and effective configuration retrieval for large collector fleets. Key capabilities include AcceptsRemoteConfig, ReportsEffectiveConfig, and ReportsHealth. This is especially valuable for multi-region deployments where consistency is critical.

OpAMP support varies significantly across collector distributions and deployment modes. The protocol specification is stable, but implementations range from beta to development status. The OpenTelemetry Collector’s OpAMP extension and supervisor modes have different maturity levels, so evaluate your specific use case and test thoroughly in non-production environments first.

extensions:
opamp:
server:
ws:
endpoint: wss://opamp-server.company.com/v1/opamp
capabilities:
- AcceptsRemoteConfig
- ReportsEffectiveConfig
- ReportsHealth

Adjust Sampling Dynamically

Adaptive sampling tunes capture rates based on live traffic, backend load, or budget targets. This can reduce ingestion volume without dropping high-value spans. The probabilistic sampler is one straightforward way to implement this.

processors:
probabilistic_sampler:
hash_seed: 22
sampling_percentage: ${OTEL_SAMPLING_RATE:10}

Tier Storage by Data Value

Route telemetry into hot, warm, or cold storage tiers based on retention policy or usage requirements. Recent critical data stays on low-latency storage; older or lower-priority data shifts to cheaper storage.

processors:
routing/storage_tier:
from_attribute: span.retention_policy
table:
- value: hot
exporters: [otlp/hot-storage]
- value: warm
exporters: [otlp/warm-storage]
- value: cold
exporters: [otlp/cold-storage]

Reduce Volume at the Edge

Dropping low-value telemetry at the source has a compounding cost benefit. Use the filter processor to exclude noise, and the attributes processor to trim or hash high-cardinality fields that aren’t needed in full.

processors:
filter/noise_reduction:
traces:
span:
- 'attributes["http.route"] == "/health"'
- 'attributes["user_agent.original"] contains "synthetic"'
attributes/cardinality_reduction:
actions:
- key: http.url
action: truncate
truncate_length: 100
- key: user.id
action: hash

This keeps your telemetry pipeline lean without sacrificing the insights you rely on. The same architectural patterns that make your collectors predictable also make them cost-efficient — if you apply the right protocols, routing, and filtering from the start.

Secure and Govern Your Telemetry Pipeline

Security and compliance in OpenTelemetry is about embedding trust into every stage of the telemetry lifecycle — from collection to storage. That means encrypting data in transit, stripping or transforming sensitive fields before export, enforcing strict change controls on collector configurations, and continuously verifying that the collectors themselves are healthy.

Encrypt Telemetry in Transit

Every hop between applications, collectors, and backends should use TLS. Without encryption, any point in the network path becomes a potential inspection or injection vector. TLS not only prevents eavesdropping but also ensures integrity — data cannot be modified mid-flight without detection.

In practice, this means configuring TLS settings on both sides of an OTLP gRPC connection:

  • Collectors validating backend identities
  • Applications validating collector identities

Even within private networks, TLS provides a low-cost safeguard against misconfiguration or insider threats.

Prevent Sensitive Data from Reaching Storage

Not all telemetry fields should be stored. Personally identifiable information (PII), credentials, and other sensitive identifiers should be removed, masked, or hashed before leaving your control.

The attributes processor can:

  • Replace values with placeholders (***REDACTED***)
  • Remove entire attributes
  • Hash values to preserve correlation without exposing raw data

Example use cases:

  • Strip Authorization and Cookie headers
  • Mask email or credit_card parameters in URLs
  • Hash user.id or customer.id

Applying these transformations at the collector ensures the same compliance rules apply to all services.

Control Who Can Modify Collectors

Collectors aren’t passive — their configuration determines what data is collected, processed, and sent. In Kubernetes, RBAC policies should restrict who can:

  • Modify collector deployments
  • View sensitive configurations
  • Restart collector pods

This prevents unauthorized changes — accidental or intentional — from disrupting telemetry pipelines or weakening compliance controls.

Monitor Collector Health

If a collector becomes overloaded or misconfigured, observability can degrade without warning. Health monitoring should include both liveness and deep operational metrics:

  • Health checks — confirm the collector process is running and responsive
  • pprof profiling endpoints — diagnose CPU or memory bottlenecks
  • Internal collector metrics — track refused spans, failed exports, queue sizes, and memory usage to detect backpressure early

For memory management, use Go’s GOMEMLIMIT or equivalent environment variables supported by your collector distribution to keep garbage collection predictable under load. The deprecated memory_ballast extension should not be used.

These metrics should feed into your monitoring and alerting stack (such as Prometheus + Alertmanager) with thresholds for:

  • High memory usage
  • Queue growth
  • Span drop rates

Proactively Verify Health

Threshold-based alerts only trigger when limits are crossed — silent degradation can still happen. Periodic verification adds another layer of assurance. Lightweight scheduled jobs (for example, Kubernetes CronJobs) can:

  • Confirm health endpoints are reachable
  • Compare export rates to expected baselines
  • Check queue depths remain within limits

Running these checks every few minutes lets you catch partial outages or misconfigurations before they create a wider visibility gap.

Troubleshoot Problems in OpenTelemetry Deployments

A stalled exporter can hide a backend bottleneck. A memory spike might be caused by unbounded cardinality upstream.

The fastest way to isolate the real issue is to walk the telemetry path in order — from collection to export to storage — confirming each stage before moving on. This prevents fixing the wrong layer and introducing new instability.

Check the collector fleet health
If ingestion is unstable, any downstream checks will be unreliable. Start with:

  • Pod or process state — all running, no unexplained restarts (kubectl get pods)
  • Version consistency — all instances on the expected collector release
  • Recent changes — config updates, scaling events, or deployments during the incident window
    If you find drift or instability here, resolve it before moving on.

Verify export path integrity
Once collectors look healthy, confirm they’re successfully sending telemetry to the next hop. Compare:

  • otelcol_exporter_sent_spans vs otelcol_exporter_send_failed_spans over the last 5–10 minutes
  • Failure rate — sustained >5% failure rate across gateways often points to a shared dependency issue
  • Backend handshake latency and gRPC/HTTP status codes
    Consistent failures across multiple gateways typically indicate a backend outage, network problem, or authentication issue.

Assess queue behaviour
Exporter queues are an early warning sign of downstream pressure. Check:

  • otelcol_exporter_queue_size vs otelcol_exporter_queue_capacity
  • Healthy — queue occupancy fluctuates below ~60% (scaling guidance)
  • Unhealthy — queues fill and stay >70%, indicating a bottleneck further downstream

If using Kafka or another broker, also monitor consumer lag against your SLA. When queues saturate, consider reducing batch sizes to flush more often, applying filtering earlier, or scaling out consumers in the aggregation tier.

Investigate memory pressure
High memory usage in collectors often points to:

  • Tail sampling buffers sized too large for current throughput (tail_sampling processor)
  • Surges in span or metric cardinality
  • Processor backlogs from slow exporters
    Use the memory_limiter processor metrics like otelcol_processor_refused_spans to spot drops. Adjust limits, right-size batch processors, and control unbounded attributes early in the pipeline.

Correlate ingestion with backend performance
If ingestion is stable but queries are slow or timing out:

  • Compare active series or span counts to historical baselines
  • Confirm retention and storage tiering are routing data as intended
  • Check for ingestion peaks that align with query latency spikes
    Routing high-cardinality telemetry to warm or cold tiers earlier can protect hot-tier query SLAs without sacrificing retention.

Work in sequence
By moving through these stages — fleet health → export integrity → queue state → memory profile → backend behaviour — you can pinpoint where the deviation starts and act at the correct layer. This keeps the investigation scoped and avoids restarts or broad reconfigurations that mask the root cause.

Turning Deployment Patterns into Operational Intelligence

The deployment patterns above — agent–gateway, federated tiers, message queue buffering — only deliver value when you can see their impact in production. Most teams deploy these patterns blindly, then discover issues during incidents.

Last9 bridges this visibility gap by providing unified observability for your OpenTelemetry deployment:

Unified telemetry correlation — Traces, metrics, and logs from your collectors appear in a single interface, making it easier to connect deployment changes to system behavior

High-cardinality support — Rich resource attributes and collector metadata don’t slow down queries or inflate costs, so you can maintain detailed visibility at scale

Real-time impact assessment — Configuration changes to collectors become visible immediately in data volume, processing patterns, and system performance

OpenTelemetry-native integration — No additional instrumentation or configuration required to start visualizing collector behavior

The operational advantage: When you modify collector configurations — adjusting batch sizes, changing processors, or updating routing rules — you can immediately assess the impact rather than waiting for the next outage to reveal problems.

To see your deployment design translate into operational visibility, talk to our team, or if you want to get started at your own pace, start for free!

In our next piece, we’ll dig into how to pick the right backend — weighing hosted versus self-hosted setups, understanding vendor lock-in risks, checking compatibility, and breaking down cost models like events-based versus GB-based pricing.

Contents

Do More with Less

Unlock unified observability and faster triaging for your team.