Guide — Deploying OpenTelemetry at Scale: Production Patterns That Work

You have started with OpenTelemetry in a single service — maybe a staging API, a background worker, or a low-traffic job. The setup is straightforward: a couple of receivers, a batch processor with safe defaults, and one OTLP exporter to the backend. Memory usage is stable, CPU overhead is predictable, and you can afford to trace everything without sampling.

As you roll it out further, the scope changes. Now, it’s over 50 services across multiple clusters and regions, each with its own unique load profile. Some collectors handle sustained API traffic at tens of thousands of spans per second; others ingest bursty batches from scheduled jobs. Exporters need to send data to multiple destinations — some over mTLS-secured OTLP, others via Kafka for durability.

At this point, new variables appear in the pipeline:

Configuration drift that drops resource attributes in one region but not another
Batch processor settings that work for one workload but trigger memory restarts in another
Exporters that stall under sustained load, leading to dropped spans or growing queues

Solving these case-by-case works in the moment, but over time, small differences between environments compound. Keeping the pipeline reliable at this scale means applying patterns that maintain consistency, handle variable workloads, and scale without sacrificing visibility.

In this part of the OTel series, we talk about the production rollout strategies that hold up across environments, configuration-as-code practices that work for multi-team setups, and techniques for sustaining performance while retaining full telemetry coverage.

The Real Performance Impact

Before deciding how to deploy OpenTelemetry at scale, let’s understand the typical overhead you can expect — so you can size collectors, tune processors, and plan capacity with intent.

Coroot’s load tests, summarized by InfoQ, measured the overhead of continuous tracing on a minimal HTTP service handling ~50k requests/sec:

CPU — Around 35% higher usage compared to baseline, even with spare CPU capacity on the node.
Memory — RSS increased by 5–8 MB, mainly from batching and buffering spans before export.
Latency — P99 latency rose from ~10 ms to ~15 ms under sustained load.
Network — Approximately 4 MB/s additional outbound traffic when sending unsampled, full request-level traces.

These numbers give you a concrete starting point. With this, you can design your rollout to:

Set batch sizes that balance compression gains with memory headroom
Apply sampling strategies where they deliver the most cost/performance benefit
Decide if an agent-only setup will hold, or if you need gateway tiers for heavier processing
Model horizontal scale and memory allocation before workloads hit production

OpenTelemetry Deployment Models at Scale

When you plan an OpenTelemetry rollout, you’ll usually see three main deployment approaches in production:

1. All-in-One Collector

In this pattern, all applications send telemetry directly to a single, centralized collector service. That collector is responsible for receiving, processing, and exporting all telemetry — metrics, logs, and traces — to one or more backends.

It’s the easiest to get started with:

One deployment to manage
One set of configuration files to maintain
No coordination between multiple collectors or tiers

Because there’s only a single collector service, it’s well-suited for:

Early-stage rollouts in staging or pre-production
Small production environments where traffic patterns are predictable
Single-team ownership of both applications and the telemetry pipeline
Workloads that don’t require advanced routing, multi-tenant isolation, or complex sampling

As traffic grows, the all-in-one collector becomes the single choke point in your pipeline. Processing, batching, compression, and exporting all happen on the same instance, so CPU, memory, and queue depth all scale together.

Signs you’re approaching the limits
Instead of relying on fixed service counts or throughput numbers, watch for operational signals:

Queue utilisation — exporter or processor queues consistently running at 60–70% capacity for extended periods
Memory pressure — the memory_limiter processor starts dropping spans or metrics
CPU saturation — sustained CPU usage leaves little headroom for bursts, often caused by encryption or heavy compression in exporters
Export delays — flush times get longer, increasing end-to-end latency for telemetry delivery

When queues start backing up or memory pressure becomes routine, scaling a single collector vertically only buys you so much time. At that point, moving to an agent-only or agent–gateway hybrid model lets you distribute load and apply processing closer to the data source.

2. Agent-Only Collectors — Per-host resilience, distributed management

In this pattern, a lightweight collector runs alongside workloads on every node — usually as a Kubernetes DaemonSet or sidecar. Each agent batches, compresses, and enriches telemetry locally before sending it directly to the backend.

Because processing happens close to the workload, this pattern adds resilience against transient network or backend slowdowns. Exporters can retry and buffer locally without introducing a single ingestion bottleneck for the entire cluster.

When it’s a fit

Host-level metadata is important for analysis (e.g., using the Resource Detection Processor)
Backends can ingest directly from many nodes without requiring a central routing tier
If you want to isolate telemetry from each node to contain the blast radius of any collector failure

Strengths

Local buffering and retry — sending_queue and memory_limiter prevent short network blips from causing immediate drops
Consistent enrichment — resource and environment tags applied at the source
No single ingestion bottleneck — each agent operates independently

Operational trade-offs

Configuration changes — sampling, routing, or exporter updates - must be rolled out to every agent; this increases coordination effort in larger fleets
Without consistent configuration management, drift can emerge between nodes or clusters
Local queues can still overflow if sustained traffic exceeds the configured limits; monitor queue occupancy and memory consumption to determine safe headroom

Here’s an example of a Kubernetes DaemonSet agent config:

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

processors:
  batch:
    send_batch_size: 1024
    send_batch_max_size: 1500
    timeout: 200ms
  memory_limiter:
    limit_mib: 256
    spike_limit_mib: 64
    check_interval: 5s
  resource:
    attributes:
      - key: k8s.cluster.name
        value: ${K8S_CLUSTER_NAME}
        action: upsert
  resourcedetection:
    detectors: [env, system, k8snode]
    override: false

exporters:
  otlp:
    endpoint: gateway-collector.monitoring.svc:4317
    compression: gzip
    tls:
      insecure: false
    sending_queue:
      enabled: true
      num_consumers: 4
      queue_size: 1000
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 30s
      max_elapsed_time: 300s

Scaling considerations
Instead of relying on fixed span/sec numbers, base scaling decisions on real telemetry from your agents:

Queue utilisation — if sending_queue usage regularly exceeds 60–70%, either increase capacity or scale out
Memory pressure — watch for the memory_limiter dropping data; if this happens under normal load, adjust limits or batch sizes
Exporter throughput — monitor exporter send rates vs. backend ingest capacity; saturation here means either sampling more or adding an intermediate tier

3. Agent–Gateway Hybrid — Designed for scale and centralized control

In this model, agent collectors run per host or pod, performing lightweight tasks like batching, compression, and basic enrichment. They forward telemetry to a smaller fleet of gateway collectors that handle heavier, centralized processing: tail-based sampling, cross-service correlation, multi-backend routing, and global redaction/enrichment policies.

When it’s a fit

You need to apply routing, sampling, or redaction rules across many services without modifying workloads
Multiple backend destinations require different export formats or protocols (e.g., traces to multiple vendors, metrics via remote_write)
Tail sampling or cross-trace correlation requires a global view of traffic
You want to keep enrichment and filtering rules consistent across teams and environments

Strengths

Centralized policy changes can be applied instantly without redeploying workloads
Supports complex routing — for example, sending traces to two APM vendors while sending metrics to Prometheus
Easier to test and roll out global enrichment, sampling, or redaction policies

Operational considerations

Deploy gateways in high-availability pairs per region to prevent ingestion loss during failures or upgrades
Validate policy and routing changes in a staging environment — a misconfiguration here can affect all incoming telemetry
Monitor exporter and processor queue utilisation — if gateway-level queues remain consistently high for extended periods, consider scaling out gateways horizontally or increasing queue capacity.
Watch memory limiter metrics — consistent drops indicate the need to tune processor settings or increase memory allocation

Example: Gateway collector configuration

extensions:
  health_check:
    endpoint: 0.0.0.0:13133

processors:
  batch:
    send_batch_size: 8192
    send_batch_max_size: 8192
    timeout: 5s
  memory_limiter:
    limit_mib: 1500
    spike_limit_mib: 300
  tail_sampling:
    decision_wait: 30s
    num_traces: 50000
    expected_new_traces_per_sec: 100
    policies:
      - name: errors_and_slow
        type: composite
        composite:
          operator: or
          policies:
            - name: errors
              type: status_code
              status_code: { status_codes: [ERROR] }
            - name: slow_requests
              type: latency
              latency: { threshold_ms: 2000 }
      - name: sample_normal
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }

exporters:
  otlp:
    endpoint: https://your-backend.com
    compression: gzip
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000
    retry_on_failure:
      enabled: true
      initial_interval: 1s
      max_interval: 30s
      max_elapsed_time: 300s

Why this pattern scales well
By splitting lightweight collection (agents) from heavier, centralized processing (gateways), you:

Keep per-node collectors small and efficient
Isolate heavy operations like tail sampling from workloads
Gain flexibility to scale collection and processing independently

Beyond Agent–Gateway

The agent–gateway pattern gives you a strong baseline: agents handle local batching and enrichment, gateways manage policy and routing. But at high volumes or across globally distributed teams, two extensions can prevent bottlenecks and improve resilience — message queues for durability, and federated collectors for regional scale.

Durable Buffering with Message Queues

When sustained throughput or sudden bursts start to push a gateway’s processing and queue limits, you risk data loss or backpressure on workloads. Placing a durable, high-throughput message bus — such as Kafka, Amazon MSK, Google Pub/Sub, or Azure Event Hubs — between agents and gateways decouples collection from aggregation.

This approach offers several benefits:

Bursts can be absorbed without dropping data.
Enables catch-up after downstream slowdowns.
Lets you scale processing tiers independently of ingestion.

It works best in scenarios such as:

Traffic regularly reaching tens or hundreds of MB/s.
Workloads with highly variable p99 latencies or queue depths.
Multiple backends needing the same telemetry without changing every agent’s configuration.

The trade-offs are operational: queues can still lose telemetry if retention windows expire before processing, so align retention with worst-case recovery times; consumer lag can delay visibility, which can be mitigated by scaling out consumers or optimising processing throughput; and poor topic partitioning can create uneven load, avoidable by choosing partition keys that distribute traffic evenly across brokers.

Here’s how a typical flow looks:

Application → Agent Collector → Kafka → Aggregator Collector → Backend(s)

Federated Collectors for Global Scale

In multi-region environments, routing all telemetry to a single central gateway adds cross-region latency, increases transfer costs, and risks turning that gateway into a single choke point. Federated deployments solve this by introducing regional gateways — lightweight aggregation points that process data locally, apply enrichment and filtering, and then forward upstream to a global gateway.

This keeps ingestion latency low for local workloads, respects data sovereignty boundaries, and isolates failures to the affected region. It’s particularly effective in active–active architectures where regions need to operate independently but still feed into a unified backend. The challenges are mainly around governance: ensuring regional policies don’t drift, forwarding rules are correct, and upstream outages don’t silently fill buffers without alerting.

A common flow is:

Application → Agent Collector → Regional Gateway → Global Gateway → Backend(s)

Build Predictable Collector Configurations

With OTel adoption, the collector’s configuration effectively becomes part of your production infrastructure. Changes to batch processor sizes, memory limits, or exporter settings can have cluster-wide effects, so it’s worth managing them with the same discipline as application code — generated, validated, versioned, and deployed through CI/CD.

The Collector’s validate command is a useful safeguard for syntax and schema correctness before rollout. It won’t catch every problem — for example, it can’t detect circular dependencies or mismatched component types in pipelines — but when combined with consistent generation and deployment practices, it helps maintain a reliable baseline across environments.

Template-Driven Config Generation

Maintaining consistency across environments is easier when configuration is generated from a single typed source. For example, a Go or Python struct can output final YAML, ensuring that batch sizes, memory limits, exporters, and headers stay aligned without repetitive edits:

type CollectorConfig struct {
    Environment   string
    Team          string
    SamplingRate  float64
    DataRetention string
    BackendURL    string
    MemoryLimitMB int
    BatchSize     int
}

This approach is especially valuable when:

Multiple environments or teams require small variations
Guardrails are needed on performance-sensitive settings
Configuration changes must be made centrally and rolled out automatically

Deploy Collectors with Infrastructure as Code

Using IaC tools like Terraform or Helm keeps collector deployments reproducible and auditable. This ensures the same configuration patterns are applied across clusters, regions, or tiers — whether agents, gateways, or federated collectors — and makes scaling more predictable.

resource "kubernetes_deployment" "otel_collector" {
  count = var.collector_instances
  metadata {
    name      = "otel-collector-${count.index}"
    namespace = var.namespace
  }
  spec {
    template {
      spec {
        container {
          name  = "otelcol"
          image = "otel/opentelemetry-collector-contrib:${var.otel_version}"
        }
      }
    }
  }
}

Validate Configurations in CI

Adding configuration validation to CI pipelines catches syntax errors, missing processors, or unintentional changes to critical settings before they reach production. The native validator can be paired with custom static checks — for example, detecting high-cardinality attributes or sensitive fields.

CONFIG_FILE="${1:-collector.yaml}"
otelcol validate --config="${CONFIG_FILE}"
grep -E "(user_id|session_id|request_id)" "${CONFIG_FILE}" &&
  echo "⚠️ High-cardinality attribute detected"

While otelcol validate is not a full semantic validator, it’s a low-friction step that, combined with IaC and templating, helps keep your collector fleet’s configuration predictable and safe to change at scale.

Reduce Telemetry Costs Without Losing Coverage

Once the pipeline is stable, the next challenge is cost. Network transfer, collector processing, and backend storage can all grow faster than expected as coverage expands. The aim is to keep the same level of visibility while eliminating unnecessary overhead.

Optimize Collector-to-Collector Links

For high-volume internal hops — such as agent → gateway or regional → global — the OTLP Arrow protocol can dramatically reduce network traffic.

In production deployments, OTLP Arrow achieves 30-70% bandwidth reduction compared to OTLP with zstd compression, with ServiceNow reporting 30% improvement in similarly configured trace pipelines and up to 10x compression versus uncompressed OTLP in synthetic benchmarks.

In ServiceNow’s production deployment, Arrow delivered a 30–70% reduction in network bandwidth compared to OTLP/gRPC using large batch sizes with zstd compression. In controlled benchmarks, it achieved up to a 10× improvement over uncompressed OTLP.

Arrow achieves this by using a columnar data format with higher compression efficiency. Deployments need Arrow-capable receivers and exporters, along with sufficient CPU and memory to process the format efficiently.

In lower-throughput or resource-constrained collectors, standard OTLP with compression can be the more practical choice — especially if network usage is already within budget and CPU headroom is limited.

When to use Arrow: Ideal for internal collector-to-collector links carrying sustained high-volume traffic, multi-backend fan-out, or long-haul regional aggregation where bandwidth savings directly reduce cost or latency.

When to use standard OTLP: Best suited for moderate-throughput pipelines, edge collectors with tight resource limits, or environments where operational simplicity outweighs maximum compression gains.

Manage Fleets with OpAMP

OpAMP enables remote configuration updates, health reporting, and effective configuration retrieval for large collector fleets. Key capabilities include AcceptsRemoteConfig, ReportsEffectiveConfig, and ReportsHealth. This is especially valuable for multi-region deployments where consistency is critical.

OpAMP support varies significantly across collector distributions and deployment modes. The protocol specification is stable, but implementations range from beta to development status. The OpenTelemetry Collector’s OpAMP extension and supervisor modes have different maturity levels, so evaluate your specific use case and test thoroughly in non-production environments first.

extensions:
  opamp:
    server:
      ws:
        endpoint: wss://opamp-server.company.com/v1/opamp
    capabilities:
      - AcceptsRemoteConfig
      - ReportsEffectiveConfig
      - ReportsHealth

Adjust Sampling Dynamically

Adaptive sampling tunes capture rates based on live traffic, backend load, or budget targets. This can reduce ingestion volume without dropping high-value spans. The probabilistic sampler is one straightforward way to implement this.

processors:
  probabilistic_sampler:
    hash_seed: 22
    sampling_percentage: ${OTEL_SAMPLING_RATE:10}

Tier Storage by Data Value

Route telemetry into hot, warm, or cold storage tiers based on retention policy or usage requirements. Recent critical data stays on low-latency storage; older or lower-priority data shifts to cheaper storage.

processors:
  routing/storage_tier:
    from_attribute: span.retention_policy
    table:
      - value: hot
        exporters: [otlp/hot-storage]
      - value: warm
        exporters: [otlp/warm-storage]
      - value: cold
        exporters: [otlp/cold-storage]

Reduce Volume at the Edge

Dropping low-value telemetry at the source has a compounding cost benefit. Use the filter processor to exclude noise, and the attributes processor to trim or hash high-cardinality fields that aren’t needed in full.

processors:
  filter/noise_reduction:
    traces:
      span:
        - 'attributes["http.route"] == "/health"'
        - 'attributes["user_agent.original"] contains "synthetic"'
  attributes/cardinality_reduction:
    actions:
      - key: http.url
        action: truncate
        truncate_length: 100
      - key: user.id
        action: hash

This keeps your telemetry pipeline lean without sacrificing the insights you rely on. The same architectural patterns that make your collectors predictable also make them cost-efficient — if you apply the right protocols, routing, and filtering from the start.

Secure and Govern Your Telemetry Pipeline

Security and compliance in OpenTelemetry is about embedding trust into every stage of the telemetry lifecycle — from collection to storage. That means encrypting data in transit, stripping or transforming sensitive fields before export, enforcing strict change controls on collector configurations, and continuously verifying that the collectors themselves are healthy.

Encrypt Telemetry in Transit

Every hop between applications, collectors, and backends should use TLS. Without encryption, any point in the network path becomes a potential inspection or injection vector. TLS not only prevents eavesdropping but also ensures integrity — data cannot be modified mid-flight without detection.

In practice, this means configuring TLS settings on both sides of an OTLP gRPC connection:

Collectors validating backend identities
Applications validating collector identities

Even within private networks, TLS provides a low-cost safeguard against misconfiguration or insider threats.

Prevent Sensitive Data from Reaching Storage

Not all telemetry fields should be stored. Personally identifiable information (PII), credentials, and other sensitive identifiers should be removed, masked, or hashed before leaving your control.

The attributes processor can:

Replace values with placeholders (***REDACTED***)
Remove entire attributes
Hash values to preserve correlation without exposing raw data

Example use cases:

Strip Authorization and Cookie headers
Mask email or credit_card parameters in URLs
Hash user.id or customer.id

Applying these transformations at the collector ensures the same compliance rules apply to all services.

Control Who Can Modify Collectors

Collectors aren’t passive — their configuration determines what data is collected, processed, and sent. In Kubernetes, RBAC policies should restrict who can:

Modify collector deployments
View sensitive configurations
Restart collector pods

This prevents unauthorized changes — accidental or intentional — from disrupting telemetry pipelines or weakening compliance controls.

Monitor Collector Health

If a collector becomes overloaded or misconfigured, observability can degrade without warning. Health monitoring should include both liveness and deep operational metrics:

Health checks — confirm the collector process is running and responsive
pprof profiling endpoints — diagnose CPU or memory bottlenecks
Internal collector metrics — track refused spans, failed exports, queue sizes, and memory usage to detect backpressure early

For memory management, use Go’s GOMEMLIMIT or equivalent environment variables supported by your collector distribution to keep garbage collection predictable under load. The deprecated memory_ballast extension should not be used.

These metrics should feed into your monitoring and alerting stack (such as Prometheus + Alertmanager) with thresholds for:

High memory usage
Queue growth
Span drop rates

Proactively Verify Health

Threshold-based alerts only trigger when limits are crossed — silent degradation can still happen. Periodic verification adds another layer of assurance. Lightweight scheduled jobs (for example, Kubernetes CronJobs) can:

Confirm health endpoints are reachable
Compare export rates to expected baselines
Check queue depths remain within limits

Running these checks every few minutes lets you catch partial outages or misconfigurations before they create a wider visibility gap.

Troubleshoot Problems in OpenTelemetry Deployments

A stalled exporter can hide a backend bottleneck. A memory spike might be caused by unbounded cardinality upstream.

The fastest way to isolate the real issue is to walk the telemetry path in order — from collection to export to storage — confirming each stage before moving on. This prevents fixing the wrong layer and introducing new instability.

Check the collector fleet health
If ingestion is unstable, any downstream checks will be unreliable. Start with:

Pod or process state — all running, no unexplained restarts (kubectl get pods)
Version consistency — all instances on the expected collector release
Recent changes — config updates, scaling events, or deployments during the incident window
If you find drift or instability here, resolve it before moving on.

Verify export path integrity
Once collectors look healthy, confirm they’re successfully sending telemetry to the next hop. Compare:

otelcol_exporter_sent_spans vs otelcol_exporter_send_failed_spans over the last 5–10 minutes
Failure rate — sustained >5% failure rate across gateways often points to a shared dependency issue
Backend handshake latency and gRPC/HTTP status codes
Consistent failures across multiple gateways typically indicate a backend outage, network problem, or authentication issue.

Assess queue behaviour
Exporter queues are an early warning sign of downstream pressure. Check:

otelcol_exporter_queue_size vs otelcol_exporter_queue_capacity
Healthy — queue occupancy fluctuates below ~60% (scaling guidance)
Unhealthy — queues fill and stay >70%, indicating a bottleneck further downstream

If using Kafka or another broker, also monitor consumer lag against your SLA. When queues saturate, consider reducing batch sizes to flush more often, applying filtering earlier, or scaling out consumers in the aggregation tier.

Investigate memory pressure
High memory usage in collectors often points to:

Tail sampling buffers sized too large for current throughput (tail_sampling processor)
Surges in span or metric cardinality
Processor backlogs from slow exporters
Use the memory_limiter processor metrics like otelcol_processor_refused_spans to spot drops. Adjust limits, right-size batch processors, and control unbounded attributes early in the pipeline.

Correlate ingestion with backend performance
If ingestion is stable but queries are slow or timing out:

Compare active series or span counts to historical baselines
Confirm retention and storage tiering are routing data as intended
Check for ingestion peaks that align with query latency spikes
Routing high-cardinality telemetry to warm or cold tiers earlier can protect hot-tier query SLAs without sacrificing retention.

Work in sequence
By moving through these stages — fleet health → export integrity → queue state → memory profile → backend behaviour — you can pinpoint where the deviation starts and act at the correct layer. This keeps the investigation scoped and avoids restarts or broad reconfigurations that mask the root cause.

Turning Deployment Patterns into Operational Intelligence

The deployment patterns above — agent–gateway, federated tiers, message queue buffering — only deliver value when you can see their impact in production. Most teams deploy these patterns blindly, then discover issues during incidents.

Last9 — an OpenTelemetry-native observability platform — bridges this visibility gap by providing unified observability for your OpenTelemetry deployment:

• Unified telemetry correlation — Traces, metrics, and logs from your collectors appear in a single interface, making it easier to connect deployment changes to system behavior

• High-cardinality support — Rich resource attributes and collector metadata don’t slow down queries or inflate costs, so you can maintain detailed visibility at scale

• Real-time impact assessment — Configuration changes to collectors become visible immediately in data volume, processing patterns, and system performance

• OpenTelemetry-native integration — No additional instrumentation or configuration required to start visualizing collector behavior

The operational advantage: When you modify collector configurations — adjusting batch sizes, changing processors, or updating routing rules — you can immediately assess the impact rather than waiting for the next outage to reveal problems.

To see your deployment design translate into operational visibility, talk to our team, or if you want to get started at your own pace, start for free!

In our next piece, we’ll dig into how to pick the right backend — weighing hosted versus self-hosted setups, understanding vendor lock-in risks, checking compatibility, and breaking down cost models like events-based versus GB-based pricing.

Deploying OpenTelemetry at Scale: Production Patterns That Work

Contents