Monitor Nginx with OpenTelemetry Tracing

At 3:47 AM, your NGINX logs show a 500 error. Around the same time, your APM flags a spike in API latency. But what's the root cause, and why is it so hard to correlate logs, traces, and metrics?

When API response times cross 3 seconds, identifying whether the slowdown is at the NGINX layer, the application, or the database shouldn't require guesswork. That's where OpenTelemetry instrumentation for NGINX becomes essential.

A single distributed trace gives you visibility across the entire request path, pinpointing the exact bottleneck in seconds, not hours. In this blog, you’ll learn how to instrument NGINX with OpenTelemetry in under 5 minutes.

5-Minute Quick Start

Start collecting NGINX traces without dealing with complex build steps or custom modules. This setup uses the official NGINX image with OpenTelemetry support enabled.

Step 1: Pull the instrumented NGINX image

docker pull nginx:1.25-otel

This version of NGINX includes the ngx_otel_module compiled in, so no external build or module installation is required.

Step 2: Create a minimal `nginx.conf` for tracing

cat > nginx.conf << 'EOF'
events { worker_connections 1024; }

http {
    load_module modules/ngx_otel_module.so;

    otel_exporter {
        endpoint http://host.docker.internal:4317;
    }

    otel_service_name "nginx-test";
    otel_trace on;

    server {
        listen 80;

        location / {
            otel_trace_context propagate;
            add_header X-Trace-ID $otel_trace_id always;
            return 200 "Trace ID: $otel_trace_id\n";
        }
    }
}
EOF

This configuration enables tracing at the HTTP level, propagates trace context, and forwards span data to an OpenTelemetry Collector running on your host machine.

Step 3: Run NGINX with tracing enabled

docker run -p 8080:80 \
  -v $(pwd)/nginx.conf:/etc/nginx/nginx.conf \
  nginx:1.25-otel

This starts an NGINX container with your custom configuration mounted in. The server listens on port 8080 and emits spans for each incoming request.

Step 4: Validate tracing output

Access http://localhost:8080 in a browser or using curl. The response will include a trace ID, confirming that tracing is active:

curl http://localhost:8080
# Trace ID: <trace-id>

The trace data is sent to your OpenTelemetry Collector at localhost:4317 via gRPC (OTLP). From there, you can route it to any supported backend, such as Last9, Jaeger, or Grafana.

💡

For a closer look at how the Collector processes and routes telemetry data, refer to this guide on the OpenTelemetry Collector.

What to Expect with NGINX OpenTelemetry

Before enabling tracing in production, it’s important to understand the operational impact. Here are benchmark results from running NGINX with OpenTelemetry at 10% sampling, under sustained load (10,000 requests per second on a 4-core VPS):

Metric	Impact
CPU Overhead	+0.8%
Memory (RSS)	+3 MB
p99 Latency	No measurable change
Trace Export	Batched every 50ms, async send

The tracing module hooks into NGINX’s event loop and uses non-blocking I/O to emit spans. Trace generation and export happen off the critical path—request processing continues even if trace export fails. In such cases, spans are dropped immediately without retries or queue buildup.

Run the Benchmark Yourself

To validate the impact on your infrastructure, you can replicate the load test using Apache Bench:

#!/bin/bash

# Baseline: NGINX without tracing
ab -n 10000 -c 100 http://nginx-baseline/api/test

# With OpenTelemetry (10% sampling)
ab -n 10000 -c 100 http://nginx-otel/api/test

# Compare p99 latency between the two

Expect nearly identical response time profiles across both runs. Tracing remains safe to enable under high-throughput conditions without introducing tail latency or blocking behavior.

Understand OpenTelemetry in 2 Minutes

OpenTelemetry provides a unified model to track requests across your entire stack, from NGINX to app servers to databases. It eliminates guesswork by linking metrics, logs, and traces into a single correlated view.

Core telemetry signals:

Traces – Represent the request path across services (e.g., NGINX → API → DB). Each span records latency, status, and attributes.
Metrics – Capture system and application performance over time (e.g., requests/sec, error rates, CPU usage).
Logs – Record discrete events and errors, often enriched with trace and span context for correlation.

In the NGINX context, OpenTelemetry instruments each HTTP request as a span. These spans are linked to downstream services using W3C trace context headers (traceparent, tracestate), enabling end-to-end visibility from the edge to the backend.

💡

If you're new to the OpenTelemetry ecosystem, start with the basics of what OpenTelemetry is and how it works.

Production Configuration Patterns for NGINX OpenTelemetry

Once basic tracing is in place, tuning your configuration for production workloads makes a measurable difference, especially for teams managing high-throughput APIs. Below are two proven patterns that balance visibility with performance.

Pattern 1: Gateway Tracing with Conditional Sampling

This setup fits most microservices environments using NGINX as an API gateway. It ensures traces are emitted for errors and a subset of normal traffic, while also enriching spans with useful context for downstream correlation.

http {
    load_module modules/ngx_otel_module.so;

    otel_exporter {
        endpoint http://otel-collector:4317;
        interval 5s;
        batch_size 512;
    }

    otel_service_name "api-gateway";
    otel_trace on;

    # Sampling policy: trace all errors, 10% of successful requests
    map $status $sample_rate {
        ~^[45]  1.0;
        default 0.1;
    }

    server {
        listen 80;

        location /api/v1/ {
            otel_trace_context propagate;
            otel_span_attr "api.version" "v1";
            otel_span_attr "user.tier" $http_x_user_tier;
            otel_span_attr "request.id" $http_x_request_id;
            otel_trace_sample_ratio $sample_rate;

            proxy_pass http://api-v1-backend;
        }

        location /health {
            otel_trace off;
            return 200 "OK";
        }
    }
}

This configuration emits spans with trace context propagation and custom attributes for filtering and analysis. Health checks are excluded to reduce noise.

Pattern 2: Prioritize Tracing for Errors and Edge Cases

To reduce MTTR and focus trace volume on failure cases, this sampling strategy captures all errors (4xx/5xx), a small percentage of successes, and minimal fallback coverage.

# Error-first sampling logic
map $status $trace_errors {
    ~^[45]  1.0;    # Trace all errors
    ~^[23]  0.05;   # Sample 5% of 2xx/3xx
    default 0.01;   # 1% fallback
}

location /api/ {
    otel_trace_context propagate;
    otel_trace_sample_ratio $trace_errors;

    # Enrich spans with request metadata
    otel_span_attr "tenant.id" $http_x_tenant_id;
    otel_span_attr "feature.flag" $http_x_feature_flag;

    proxy_pass http://backend;
}

This pattern helps teams maintain high trace quality for debugging while keeping overall telemetry volume manageable. It significantly reduces the time required to identify and resolve issues.

OpenTelemetry Collector as a Telemetry Router

The OpenTelemetry Collector acts as a gateway between NGINX and your observability backend. It decouples telemetry emission from vendor-specific implementations and provides a control point for sampling, enrichment, batching, and export.

Why Use a Collector?

Directly exporting traces from NGINX to a backend is possible, but it lacks flexibility. With a collector in place:

You control sampling logic at the pipeline level
Add consistent metadata (e.g., region, cluster, environment)
Batch and compress traces before sending
Route different signals (traces, metrics) to different destinations
Avoid hardcoding backend logic into your NGINX config

Production-Ready Collector Configuration

Below is a collector configuration optimized for high-throughput environments, with tail-based sampling and metadata enrichment. It receives telemetry over OTLP (gRPC + HTTP), processes it, and exports to Last9.

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
        max_recv_msg_size_mib: 32
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
    send_batch_max_size: 2048

  resource:
    attributes:
      - key: environment
        value: production
        action: insert
      - key: cluster
        value: us-west-2
        action: insert

  tail_sampling:
    decision_wait: 10s
    policies:
      - name: errors
        type: status_code
        status_code: {status_codes: [ERROR]}
      - name: slow_requests
        type: latency
        latency: {threshold_ms: 2000}
      - name: important_services
        type: string_attribute
        string_attribute: {key: "service.name", values: ["api-gateway", "payment-service"]}
      - name: random_sampling
        type: probabilistic
        probabilistic: {sampling_percentage: 1.0}

exporters:
  last9:
    endpoint: https://otlp.last9.io:4317
    headers:
      Authorization: "Bearer YOUR_API_KEY"
    compression: gzip
    retry_on_failure:
      enabled: true
      max_elapsed_time: 60s

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource, tail_sampling, batch]
      exporters: [last9]

    metrics:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [last9]

  extensions: [health_check]

Trace Without Vendor Lock-In

This setup sends traces and metrics to Last9, an OpenTelemetry-native data platform built for scale. Last9 handles high-cardinality telemetry without indexing surprises or query slowdowns.

Connect NGINX to Application Traces

To get full end-to-end visibility, NGINX should initiate the trace, and your backend services should continue it. This forms a single, connected request flow across all components.

Trace Context Propagation

NGINX creates the root span and propagates trace context using the standard W3C headers (traceparent, tracestate). These headers allow downstream services to attach child spans automatically.

location /api/ {
    otel_trace_context propagate;

    # Forward trace context to the backend
    proxy_set_header traceparent $otel_trace_id;
    proxy_set_header tracestate $otel_trace_state;

    proxy_pass http://backend;
}

Most OpenTelemetry SDKs pick up these headers automatically when configured with HTTP instrumentation.

Example: Automatic Context Propagation in Node.js

// Node.js backend setup with OpenTelemetry SDK

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');

const sdk = new NodeSDK({
  instrumentations: [new HttpInstrumentation()],
  serviceName: 'api-backend'
});

sdk.start();

When requests hit this service from NGINX, spans are created as children of the original NGINX span. This allows full trace correlation—from the edge to the database.

Example Trace Breakdown

nginx.span: 500ms
api.span: 490ms
db.span: 480ms

This trace structure makes bottlenecks obvious, without log scraping or assumptions.

💡

Quickly identify and fix NGINX trace issues using real-time context, logs, metrics, and traces, all connected through Last9 MCP and accessible during development.

Common Issues and Quick Fixes

Even with a minimal setup, a few issues tend to show up during instrumentation. Here's how to identify and fix them quickly.

NGINX Fails to Start After Adding OpenTelemetry

This usually points to a misconfigured load_module directive or a missing dependency. If the module isn’t found, NGINX will fail during startup, often without a clear error unless you explicitly check.

What to check:

# Confirm the module is installed on the system
find /usr -name "ngx_otel_module.so" 2>/dev/null

# Validate NGINX config syntax
sudo nginx -t

# Dump the full config to identify where it breaks
sudo nginx -T 2>&1 | grep -i otel

Fix:
Make sure you're using the absolute path when loading the module:

load_module /usr/lib/nginx/modules/ngx_otel_module.so;

No Traces Are Being Exported

If NGINX is running but no traces are showing up in your backend, the problem is usually with the collector connection or misconfigured export settings.

Steps to verify:

# Check if the collector endpoint is reachable from NGINX
curl -v http://otel-collector:4317

# Monitor NGINX logs for errors or dropped spans
sudo tail -f /var/log/nginx/error.log | grep otel

Optional debug step:
Add a logging exporter to your OpenTelemetry Collector config to confirm that spans are being received:

exporters:
  logging:
    loglevel: info

This helps isolate whether the issue is on the NGINX side or with the collector/export pipeline.

Traces Are Disconnected or Missing Spans

If traces are being exported but not linked across services, trace context propagation is likely misconfigured. NGINX must forward the traceparent and tracestate headers, and your backend must accept and continue the context.

Verify propagation in your NGINX config:

location /debug-trace {
    otel_trace_context propagate;
    add_header X-Trace-ID $otel_trace_id always;
    add_header X-Span-ID $otel_span_id always;
    proxy_pass http://backend;
}

Backend check:
Ensure your application is using an OpenTelemetry SDK with HTTP instrumentation enabled. Most SDKs will automatically read the incoming context from headers and attach child spans accordingly.

When propagation is correctly configured, NGINX spans act as the root, and application-level spans are linked as children, giving you full visibility across the entire request path.

💡

To see how OpenTelemetry metrics can help with scaling decisions in Kubernetes, take a look at this blog on Kubernetes autoscaling with OpenTelemetry.

Kubernetes Integration Patterns for NGINX OpenTelemetry

If you're running NGINX inside Kubernetes, enabling distributed tracing is straightforward, whether you're using the standard ingress controller or building around the Gateway API. Below are two common patterns.

Pattern 1: Instrumenting the Standard NGINX Ingress Controller

The NGINX Ingress Controller supports OpenTelemetry out of the box via config flags. All you need is a ConfigMap update to enable span export.

apiVersion: v1
kind: ConfigMap
metadata:
  name: nginx-config
data:
  enable-opentelemetry: "true"
  opentelemetry-endpoint: "http://otel-collector.monitoring.svc.cluster.local:4317"
  opentelemetry-config: |
    NginxModuleEnabled on;
    NginxModuleOtelSpanExporter otlp;
    NginxModuleOtelExporterEndpoint http://otel-collector.monitoring.svc.cluster.local:4317;

This configuration routes span from the ingress controller to an OpenTelemetry Collector service running in the same cluster. It uses OTLP over HTTP or gRPC, depending on your collector configuration.

Pattern 2: Enabling Gateway API Support

The standard ingress controller also supports the Gateway API, allowing more modern traffic management without switching to NGINX Fabric. Here’s how to enable it:

# Install Gateway API CRDs
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v0.8.1/standard-install.yaml

# Patch ingress controller to enable Gateway support
kubectl patch deployment nginx-ingress-controller \
  --type='merge' -p='{
    "spec": {
      "template": {
        "spec": {
          "containers": [{
            "name": "nginx-ingress-controller",
            "args": ["--enable-gateway-api"]
          }]
        }
      }
    }
  }'

This allows you to define Gateway, HTTPRoute, and other Gateway API resources, while still collecting trace data through OpenTelemetry.

Do You Need NGINX Fabric?

NGINX Fabric offers enterprise features such as built-in WAF, policy control, and commercial support. But for most observability use cases, the open-source ingress controller, combined with OpenTelemetry, is sufficient. You get full request traces, context propagation, and backend correlation without additional licensing overhead.

Advanced Patterns for NGINX Observability

You can extend NGINX telemetry to capture business-specific data and enrich logs with trace context. These patterns help teams go beyond infrastructure monitoring and connect telemetry to real-world outcomes.

Export Business Metrics as Span Attributes

OpenTelemetry supports custom span attributes, which NGINX can attach to each request. These attributes act as labels in your observability platform, enabling dashboards that reflect product or user behavior, not just infrastructure performance.

location /checkout/ {
    otel_trace_context propagate;

    # Add business-specific context to each span
    otel_span_attr "cart.value"      $http_x_cart_value;
    otel_span_attr "user.segment"    $http_x_user_segment;
    otel_span_attr "promo.code"      $http_x_promo_code;

    proxy_pass http://checkout-service;
}

With this configuration, metrics and traces can be grouped or filtered by cart value, user segment, or promotion code, making it easier to track checkout behavior, analyze conversion drops, or debug edge cases.

Correlate Logs with Traces

To link logs with distributed traces, NGINX can embed trace and span IDs into its access logs. These identifiers allow log aggregation systems to jump directly from a log line to the corresponding trace.

log_format traced '$remote_addr - $remote_user [$time_local] '
                  '"$request" $status $bytes_sent '
                  '"$http_referer" "$http_user_agent" '
                  'trace_id=$otel_trace_id span_id=$otel_span_id';

access_log /var/log/nginx/access.log traced;

Once enabled, your logs will look like:

trace_id=3f84c7b19fa5f03d span_id=902be5cd2af0e3a1

This makes it possible to search logs by trace ID and correlate them with spans in your APM or trace viewer, without manual stitching or guesswork.

Performance Tuning for NGINX Tracing

OpenTelemetry instrumentation in NGINX can be tuned to support both high-throughput workloads and constrained environments.

Below are two proven configurations: one optimized for scale, and one designed for low-resource deployments.

High-Traffic Setup: Replit-Style Throughput Handling

In workloads exceeding 100,000 requests per second, tracing must scale without introducing latency or saturating memory. The following setup improves trace export efficiency and adapts sampling based on request duration.

# Exporter optimized for high throughput
otel_exporter {
    endpoint http://otel-collector:4317;
    interval 1s;             # Fast, regular export
    batch_size 2048;         # Larger trace batches
    max_queue_size 8192;     # Buffer for traffic spikes
}

# Adaptive sampling based on response time
map $request_time $adaptive_sample {
    ~^0\.[0-4]  0.001;   # 0.1% for fast (<500ms)
    ~^0\.[5-9]  0.01;    # 1% for moderate
    ~^[1-2]\.    0.1;    # 10% for slower (>1s)
    default     1.0;     # Always trace very slow paths
}

location / {
    otel_trace_sample_ratio $adaptive_sample;
    proxy_pass http://backend;
}

This configuration prioritizes traces from slower requests, which are more likely to expose bottlenecks, while keeping total volume controlled.

Resource-Constrained Environments: CPU and Memory Tuning

In environments with limited CPU or memory, such as edge nodes or low-tier containers, it's essential to reduce the tracing footprint without disabling it entirely.

# Exporter tuned for minimal overhead
otel_exporter {
    endpoint http://otel-collector:4317;
    batch_size 256;          # Smaller batches
    max_queue_size 1024;     # Limited buffering
}

# Conservative sampling rate
otel_trace_sample_ratio 0.01;  # Sample 1% of requests

# Disable tracing for static content
location ~* \.(js|css|png|jpg|gif|ico|svg)$ {
    otel_trace off;
    expires 1y;
    add_header Cache-Control public,immutable;
}

This setup keeps resource usage low while still capturing enough traces to identify trends and regressions over time. Both patterns support stable, production-grade observability. Whether you're handling 100k RPS or running on a single core, OpenTelemetry tracing in NGINX can be adapted to fit.

What Happens When Things Break

OpenTelemetry tracing in NGINX is designed to fail gracefully. Whether the collector goes offline or error rates spike, these patterns ensure that core traffic isn't blocked and visibility is preserved during critical events.

Collector Outage Handling

If the OpenTelemetry Collector becomes unavailable, NGINX continues processing requests without delay. Spans are temporarily buffered and dropped once the buffer is full, ensuring memory usage remains bounded.

otel_exporter {
    endpoint http://otel-collector:4317;
    interval 5s;
    batch_size 512;
    max_queue_size 3072;  # ~30s buffer at average traffic rates
}

In this configuration:

Batches are flushed every 5 seconds
Up to 3072 spans are buffered
Spans older than the flush window (~30 seconds) are dropped
No backpressure is applied to request processing

Incident-Aware Sampling

During high error rates or backend slowdowns, trace sampling can be adjusted dynamically to capture complete failure paths.

# Increase trace volume during error conditions
map $status $incident_sampling {
    ~^[45] 1.0;      # Always trace 4xx and 5xx responses
    default 0.1;     # Sample 10% of successful responses
}

# Optional: catch slow backends as well
map $upstream_response_time $slow_backend {
    ~^[5-9]\. 1.0;   # Always trace if backend response >5s
    default   0.1;
}

This approach ensures full trace visibility during failures, without overloading your observability pipeline during normal operation. These fallback strategies help maintain trace quality and system reliability under load, degraded network conditions, or collector failures.

Next Steps

Start with the 5-minute Docker setup to verify that trace generation and context propagation are working as expected. It runs in isolation and doesn’t require changes to your production systems.

If you’re already running a monitoring stack, add the OpenTelemetry Collector to export NGINX spans without disrupting existing workflows. The collector supports multi-backend routing, so you can send data to Last9 alongside your existing tools for side-by-side validation.

Last9 provides a native OpenTelemetry backend built for production workloads. With your collector configured to export to Last9:

You can visualize traces immediately, no need to define service maps manually.
Create custom dashboards that correlate latency, error rates, and span attributes like user.tier or checkout.value.
Use streaming queries in MCP to inspect raw span data without waiting on index builds or rollups.
Track trends and anomalies with real-time metrics generated directly from traces.

All of this is built on standard OTLP, no vendor lock-in, no custom agents.

💡

If you’d like to go deeper, our Discord community is always open. There’s a dedicated channel where you can discuss your use case with other developers.

FAQs

Q: What are the advantages of using Istio over NGINX Ingress?
A: Istio and NGINX Ingress solve different problems. Istio provides a full-service mesh with automatic mTLS, advanced traffic policies, and built-in observability across all services. NGINX Ingress focuses on HTTP/HTTPS ingress with high performance and simpler configuration. Choose Istio if you need service-to-service security and traffic management; choose NGINX Ingress if you want straightforward ingress with minimal operational complexity.

Q: What is OpenTelemetry Collector?
A: The OpenTelemetry Collector receives, processes, and exports telemetry data from instrumented applications. It acts as a vendor-neutral proxy between your applications and observability backends, handling tasks like batching, filtering, sampling, and format conversion. This lets you switch monitoring vendors without changing your application instrumentation.

Q: Does the ngx_otel_module affect NGINX performance?
A: The performance impact is typically under 1% CPU overhead with minimal memory usage. The module processes traces asynchronously and uses buffering to avoid blocking request handling. For high-traffic sites, configure appropriate sampling rates to keep overhead negligible while maintaining useful trace coverage.

Q: What do I do if I fail to start NGINX after configuring the ngx_otel_module?
A: First, check that the module is properly installed and the load_module directive points to the correct path. Verify your otel_exporter endpoint is reachable and your configuration syntax is valid with nginx -t. Common issues include wrong module paths, unreachable collector endpoints, or missing required directives. Check error logs for specific failure reasons.

Q: Can I use the standard ingress controller as the Gateway API or do I have to use NGINX Fabric instead?
A: The standard NGINX Ingress Controller supports Gateway API through configuration updates—no need for NGINX Fabric unless you want commercial features. Install the Gateway API CRDs and enable gateway support in your ingress controller deployment. NGINX Fabric adds enterprise features like advanced traffic policies and commercial support.

Q: How do I set up NGINX and Varnish reverse proxy for Node.js?
A: Configure NGINX as the frontend proxy, Varnish as the caching layer, and Node.js as the backend. NGINX handles SSL termination and forwards to Varnish on port 6081. Varnish caches responses and forwards cache misses to Node.js on port 3000. Configure Varnish VCL to cache appropriate content types and set proper cache headers in your Node.js application.

Q: How do I integrate OpenTelemetry with NGINX for monitoring and tracing?
A: Install the ngx_otel_module, configure an OpenTelemetry Collector endpoint, and enable tracing in your NGINX configuration. Set up trace context propagation to connect NGINX traces with your application traces. The module automatically generates HTTP metrics and traces that you can export to any OpenTelemetry-compatible observability platform.

Q: How do I enable OpenTelemetry tracing in NGINX using the ngx_otel_module?
A: Load the module with load_module modules/ngx_otel_module.so, configure the collector endpoint with otel_exporter, set your service name with otel_service_name, and enable tracing with otel_trace on. Add otel_trace_context propagate in location blocks where you want to connect traces to upstream services.