Last9

Implement Distributed Tracing with OpenTelemetry

Implementing distributed tracing with OpenTelemetry helps track requests across services, providing insights into performance and pinpointing issues.

Dec 26th, ‘24
Implementing Distributed Tracing with OpenTelemetry
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

Consider an e-commerce site running a flash sale. Orders are pouring in, but some customers run into delays or failed payments. With services for authentication, inventory, and payments all working together, figuring out where things slow down isn’t simple.

Distributed tracing makes this easier. It shows how each request moves through your system so you can quickly spot bottlenecks, fix errors, and keep the checkout flow reliable.

OpenTelemetry has matured a lot in 2024–2025. Profiling support is now stable, zero-code instrumentation is available, and production-ready tooling exists across more than 12 languages. This guide walks through the basics and the advanced setups you’ll use in production.

Why OpenTelemetry for Distributed Tracing?

OpenTelemetry has become the go-to standard for distributed tracing, and it’s not hard to see why.

  • Vendor neutrality. You’re not tied to a single vendor or backend. The same instrumentation can send data to Jaeger, Zipkin, Prometheus, or commercial platforms like Last9. If your needs change later, you don’t have to redo all the instrumentation—just point the data somewhere else.
  • Mature and production-ready. By 2024, the core parts of OpenTelemetry—traces, metrics, and logs—are stable across most major languages. Java, .NET, Python, and Node.js already have hundreds of supported libraries and frameworks, so you don’t spend weeks wiring things up.
  • Less friction with zero-code options. Thanks to eBPF-based auto-instrumentation, you can start capturing traces without touching application code. That means faster setup for new services and fewer changes in production environments where code modifications can be risky.
  • Four signals, one standard. OpenTelemetry now brings traces, metrics, logs, and the newer profiling signal together. Having all four in a single framework makes it easier to connect the dots—linking slow traces with CPU hotspots or correlating logs with errors in real-time.
💡
If you’re new to the project itself, here’s a clear introduction to what OpenTelemetry is and how it fits into modern observability.

What’s New in OpenTelemetry

Profiling Signal: The Fourth Pillar

In March 2024, OpenTelemetry introduced profiling as its fourth signal, alongside traces, metrics, and logs. This opens up new ways to connect system behavior with code-level performance:

  • Code-level insights. Go from a CPU spike in your metrics directly to the function consuming resources.
  • Trace-to-profile correlation. See not only where latency occurs, but which code paths are responsible.
  • Continuous profiling. Always-on performance monitoring with low overhead, giving you a steady stream of insights into application health.

Spring Boot Starter Now Stable

By September 2024, the OpenTelemetry Spring Boot Starter reached general availability. It gives Java developers more flexibility with features like:

  • Native image support. Works with Spring Boot Native applications where the Java agent cannot.
  • Configuration in-app. Use application.properties or YAML files instead of depending only on agent flags.
  • Lightweight setup. Lower startup overhead compared to the full Java agent.

eBPF Auto-Instrumentation

One of the biggest shifts in OpenTelemetry is eBPF-based auto-instrumentation, which allows you to capture traces without changing your code. Key benefits include:

  • Zero-code setup. No modifications to the application itself.
  • Broad language coverage. Works across C/C++, Go, Rust, Python, Java, Node.js, .NET, PHP, and Ruby.
  • Low resource cost. Typically under 1% CPU and about 250MB memory usage.
  • Kernel-level visibility. Captures activity across system libraries and kernel calls that agents can’t see.

File-Based Configuration

OpenTelemetry now supports YAML and JSON configuration files. This makes it easier to manage complex setups without relying solely on environment variables or application code.

💡
To understand how the Collector shapes and routes telemetry pipelines between your services and downstream systems, see The OpenTelemetry Collector Deep Dive for detailed insight into its processor, exporter, and receiver models.

Distributed Tracing Fundamentals

Before setting up instrumentation, it helps to get comfortable with the core concepts behind distributed tracing.

Traces and Spans: The Building Blocks

  • A trace is the full journey of a request as it moves through your system.
  • A span is a single unit of work in that journey, such as a service call or a database query.

Each span records metadata like start time, end time, and relationships with other spans. Together, traces and spans let you follow a request end-to-end across microservices.

with tracer.start_as_current_span("process_order") as parent_span:
    parent_span.set_attribute("order_id", "12345")
    
    with tracer.start_as_current_span("validate_payment") as child_span:
        child_span.set_attribute("payment_method", "credit_card")
        # Payment validation logic

In this example, process_order is the parent span, and validate_payment is the child span is nested within it.

Span Relationships: Parent-Child vs. Links

  • Parent-child spans form the traditional hierarchy where one span directly calls another.
  • Span links are different—they connect spans across traces without requiring a strict hierarchy.
from opentelemetry.trace import Link, SpanContext

# Create a link to relate spans across different traces
source_span_context = SpanContext(trace_id=0x1, span_id=0x2, is_remote=True)

with tracer.start_as_current_span(
    "process_batch_item", 
    links=[Link(context=source_span_context)]
) as span:
    span.set_attribute("item_id", "12345")
    # Processing logic

Span links are especially useful for:

  • Fan-out operations that trigger multiple downstream calls
  • Batch jobs where each item produces its own trace
  • Async messaging where spans need correlation across message boundaries

Span Status: Operation Outcomes

Spans also capture the outcome of operations—whether they succeeded or failed.

from opentelemetry.trace.status import Status, StatusCode

with tracer.start_as_current_span("http_request") as span:
    try:
        response_code = make_request()
        if response_code == 200:
            span.set_status(Status(StatusCode.OK))
        else:
            span.set_status(Status(StatusCode.ERROR, description=f"HTTP {response_code}"))
    except Exception as e:
        span.set_status(Status(StatusCode.ERROR, description=str(e)))

This allows you to quickly spot failing spans in a trace and connect them back to errors in your system.

Zero-Code Instrumentation Options

OpenTelemetry has made big strides in reducing the effort it takes to instrument applications. You don’t always need to touch code or even restart services to start collecting traces.

Two of the most useful options are eBPF-based instrumentation and the new Spring Boot Starter for Java.

eBPF Auto-Instrumentation

eBPF runs inside the Linux kernel and observes system calls directly. That means you can capture requests, database calls, and system interactions without modifying your applications.

Why it matters:

  • No restarts required — works on running processes
  • Works with compiled binaries you can’t change
  • Captures system-level activity, not just application calls
  • Handles multiple languages side by side

Example:

# Start the OpenTelemetry eBPF profiler
sudo ./ebpf-profiler -collection-agent=127.0.0.1:11000 -disable-tls

This starts the profiler and streams collected data to a local OpenTelemetry Collector running on port 11000. Because it runs at the kernel level, it can observe activity across all applications on the host.

Best suited for:

  • Legacy apps that you can’t recompile
  • Third-party binaries
  • Mixed-language microservices environments
  • Low-touch debugging in production

Java: Agent vs. Spring Boot Starter

For Java, you have two paths: the long-standing Java agent or the Spring Boot Starter, which became stable in late 2024.

Java Agent

java -javaagent:opentelemetry-javaagent.jar \
     -Dotel.service.name=my-service \
     -Dotel.exporter.otlp.endpoint=http://localhost:4317 \
     -jar myapp.jar

Here, you attach the OpenTelemetry agent at startup with the -javaagent flag. It automatically instruments common libraries and sends traces to your collector. This is the fastest way to get coverage, but it can add overhead during startup.

  • Pros: Maximum coverage out of the box
  • Cons: Heavier on startup time, less flexible to configure

Spring Boot Starter

<dependency>
    <groupId>io.opentelemetry.instrumentation</groupId>
    <artifactId>opentelemetry-spring-boot-starter</artifactId>
</dependency>
# application.yml configuration
otel:
  service:
    name: my-spring-service
  exporter:
    otlp:
      endpoint: http://localhost:4317
  instrumentation:
    logback-appender:
      enabled: true

Here, instrumentation is baked into your Spring Boot app itself. You manage it through the usual application.yml or .properties files, which gives you more flexibility and less overhead compared to the Java agent.

Best when:

    • You’re building Spring Boot Native apps
    • You prefer in-app config over JVM flags
    • Startup performance matters
    • You already rely on another agent in the same JVM

This way, you get both the “what to run” and the “why.”

Enhanced Traces with Attributes and Events

Distributed traces become much more useful when you enrich spans with metadata and contextual details. OpenTelemetry gives you two ways to do this: attributes and events.

Attributes: Contextual Metadata

Attributes attach static key–value metadata to spans. They’re useful for describing the request, user, or environment where the span is running.

with tracer.start_as_current_span("process_payment") as span:
    # User context
    span.set_attribute("user_id", "user_67890")
    span.set_attribute("user_tier", "premium")
    
    # Transaction details
    span.set_attribute("order_id", "12345")
    span.set_attribute("payment_method", "credit_card")
    span.set_attribute("amount", 99.99)
    span.set_attribute("currency", "USD")
    
    # Environment info
    span.set_attribute("host.name", "payment-service-1")
    span.set_attribute("deployment.environment", "production")

Here, the span for process_payment is enriched with user info, order details, and environment context. Later, you can filter traces by attributes—for example, finding only failed payments from premium users in production.

Events: Dynamic Timestamped Data

While attributes describe “what” a span is about, events capture “what happened” during the span’s lifetime. They’re timestamped markers you can attach to record key moments.

with tracer.start_as_current_span("process_payment") as span:
    span.add_event("payment_initiated", {
        "timestamp": "2024-12-26T12:00:00Z",
        "gateway": "stripe"
    })
    
    try:
        result = process_payment()
        span.add_event("payment_completed", {
            "transaction_id": result.transaction_id,
            "processing_time_ms": result.duration
        })
    except PaymentException as e:
        span.add_event("payment_failed", {
            "error_code": e.code,
            "retry_count": e.retry_count,
            "error_message": str(e)
        })

In this example, the span records three possible events: when the payment starts, when it completes, or if it fails. This helps you later replay the lifecycle of a request, correlate failures, and understand latency spikes with much finer detail than attributes alone.

How Context Propagation Works

Distributed tracing only works if requests can be followed across service boundaries. That’s what context propagation does: it carries trace metadata (like trace IDs and span IDs) along with each request, so new spans can be tied back to the same trace.

How It Works

  1. Injection – The upstream service attaches trace context to request headers.
  2. Extraction – The downstream service reads that context from incoming headers.
  3. Continuation – Any new spans created downstream are linked back to the original trace.

This allows a single trace to flow across microservices, queues, or even language boundaries.

Example: Service A Sending a Request

from opentelemetry.propagate import inject
import requests

with tracer.start_as_current_span("service_a_operation") as span:
    headers = {}
    inject(headers)  # Injects trace context
    
    response = requests.get(
        "http://service-b.example.com/api", 
        headers=headers
    )

Here, Service A creates a span and injects the trace context into HTTP headers before requesting Service B. That way, Service B knows this request belongs to the same trace.

Example: Service B Receiving a Request

from opentelemetry.propagate import extract
from flask import Flask, request

@app.route("/api")
def handle_request():
    context = extract(request.headers)  # Extracts trace context
    
    with tracer.start_as_current_span("service_b_operation", context=context) as span:
        span.set_attribute("received_from", "service_a")
        return process_request()

Service B extracts the context from incoming headers and uses it to create a new span. That span is automatically linked to Service A’s trace, giving you continuity across the two services.

Propagation Formats

OpenTelemetry supports multiple propagation formats so it can integrate with existing systems:

B3 Propagation (used by Zipkin):

X-B3-TraceId: 4bf92f3577b34da6a3ce929d0e0e4736
X-B3-SpanId: 00f067aa0ba902b7
X-B3-Sampled: 1

W3C Trace Context (default):

traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01

With propagation in place, every hop of a request is stitched into one coherent trace, making distributed tracing possible at scale.

💡
Once you’ve got things instrumented and running, this guide on Deploying OpenTelemetry at Scale: Production Patterns That Work shows patterns to keep your setup reliable under load.

Step-by-Step Implementation Guide

Here’s a structured approach that takes you from instrumentation to backend configuration, with examples you can adapt to your stack.

1. Choose Your Instrumentation Approach

OpenTelemetry offers three main ways to instrument your applications:

Manual Instrumentation – Full Control, Custom Spans

Install the required packages:

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

This approach gives you maximum flexibility. You explicitly define spans, attributes, and events in your code—ideal when you want fine-grained control.

Auto-Instrumentation – Quick Start with Libraries

Install and bootstrap:

pip install opentelemetry-distro[otlp]
opentelemetry-bootstrap -a install

Auto-instrumentation hooks into supported libraries automatically (HTTP clients, SQL drivers, etc.), so you get visibility fast without changing much code.

Zero-Code Instrumentation – eBPF for Existing Applications

For environments where touching code isn’t possible, eBPF provides zero-code tracing.

# No application changes required
sudo ./ebpf-profiler -collection-agent=localhost:11000

Because eBPF operates at the kernel level, it works with compiled binaries and across multiple languages simultaneously. Perfect for legacy apps or third-party binaries.

2. Configure the SDK

Once you’ve chosen an instrumentation approach, configure the SDK to define resources, exporters, and processors.

Python Example

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Configure tracer provider with resource attributes
trace.set_tracer_provider(TracerProvider(
    resource=Resource.create({
        "service.name": "my-service",
        "service.version": "1.0.0",
        "deployment.environment": "production"
    })
))

# Configure OTLP exporter
otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317",
    headers={
        "authorization": "Bearer your-token-here"
    }
)

# Add batch processor for efficient export
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

tracer = trace.get_tracer(__name__)

This setup registers your service with metadata (name, version, environment), sends spans to an OTLP endpoint, and batches them for performance.

File-Based Configuration (New in 2024)

You can now define OpenTelemetry configuration in YAML or JSON, making it easier to manage in containerized or multi-service environments.

# otel-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  otlp:
    endpoint: "http://your-backend:4317"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

This configuration creates a trace pipeline: receive OTLP data, batch it, and send it to your chosen backend.

3. Add Manual Instrumentation Where Needed

Even if you use auto-instrumentation, there are times when custom spans add valuable context.

HTTP Service Instrumentation Example

@app.route("/users/<user_id>")
def get_user(user_id):
    with tracer.start_as_current_span("get_user") as span:
        # Add context
        span.set_attribute("user.id", user_id)
        span.set_attribute("http.method", request.method)
        span.set_attribute("http.url", request.url)
        
        # Add event for request start
        span.add_event("request_started")
        
        try:
            # Database operation
            with tracer.start_as_current_span("db_query") as db_span:
                db_span.set_attribute("db.statement", f"SELECT * FROM users WHERE id = {user_id}")
                db_span.set_attribute("db.name", "userdb")
                
                user = db.get_user(user_id)
                
                if user:
                    span.set_attribute("user.found", True)
                    span.add_event("user_retrieved", {"user_id": user.id})
                    return jsonify(user.to_dict())
                else:
                    span.set_status(Status(StatusCode.ERROR, "User not found"))
                    return jsonify({"error": "User not found"}), 404
                    
        except Exception as e:
            span.record_exception(e)
            span.set_status(Status(StatusCode.ERROR, str(e)))
            return jsonify({"error": "Internal server error"}), 500

Here the get_user span captures request attributes, while a nested db_query span records the database statement. Events log key milestones like when the request starts or when a user is retrieved.

4. Configure Your Backend

Instrumentation data is only useful if it’s stored and visualized. You can export traces to open-source backends like Jaeger or to commercial platforms like Last9.

Using Jaeger v2 (with Built-in OTLP)

docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/jaeger:2.5.0

This runs Jaeger with OTLP gRPC and HTTP receivers enabled, so your OpenTelemetry SDKs can export directly.

OpenTelemetry Collector with Multi-Backend Export

# collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  
  # Filter sensitive data
  attributes:
    actions:
      - key: user.email
        action: delete
      - key: user.ssn
        action: delete

exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  
  # Export to Last9
  otlp/last9:
    endpoint: "https://otlp.last9.io"
    headers:
      authorization: "Bearer YOUR_LAST9_TOKEN"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, attributes]
      exporters: [jaeger, otlp/last9]

This collector setup does three things:

  1. Receives spans over OTLP (gRPC and HTTP).
  2. Processes data with batching and attribute filtering (to drop sensitive fields).
  3. Exports traces both to Jaeger for visualization and to Last9 for high-cardinality, long-term storage and analysis.
💡
For advice on building trustworthy telemetry—so your logs, metrics, and traces all give you useful, reliable insight—see Logs, Metrics, and Traces: Designing Telemetry You Can Trust.

Correlate Traces with Profiles

With profiling now part of OpenTelemetry, you can directly connect spans in traces with code-level performance data. This makes it easier to move from “a service is slow” to “this function is the bottleneck.”

Metrics to Profiles Correlation

When metrics show a CPU spike, profiling data helps you pinpoint the exact code path responsible.

# When you see a CPU spike in metrics, jump to profiling data
with tracer.start_as_current_span("cpu_intensive_operation") as span:
    span.set_attribute("operation.type", "data_processing")
    
    # This span can now be correlated with profiling data
    result = heavy_computation()
    
    span.set_attribute("records.processed", len(result))

Here the cpu_intensive_operation span is tagged with attributes that can be tied to profiling samples, showing which part of the code consumed CPU.

Traces to Profiles Correlation

Profiling also helps when investigating latency in traces.

# Slow trace spans can be correlated with profiling data
with tracer.start_as_current_span("slow_operation") as span:
    start_time = time.time()
    
    result = complex_algorithm()
    
    duration = time.time() - start_time
    span.set_attribute("operation.duration_ms", duration * 1000)
    
    if duration > 1.0:  # Slow operation
        span.add_event("performance_investigation_needed", {
            "duration_threshold_exceeded": True,
            "profile_correlation_available": True
        })

In this example, a slow span triggers an event that signals you can jump into profiling data to understand which functions caused the slowdown.

Best Practices for Production

Adding traces and profiles together is powerful, but to keep it efficient and secure in production, a few practices help.

1. Smart Sampling Strategies

Sampling ensures you capture the right traces without overwhelming your backend.

Head-based sampling (SDK level):

from opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased

# Sample 10% of traces, but always sample if parent was sampled
sampler = ParentBased(TraceIdRatioBased(0.1))

trace.set_tracer_provider(TracerProvider(sampler=sampler))

This samples a percentage of traces at the point of creation, with parent traces always preserved.

Tail-based sampling (Collector level):

processors:
  tail_sampling:
    policies:
      # Always sample errors
      - name: error-sampling
        type: status_code
        status_code:
          status_codes: [ERROR]
      
      # Sample slow requests
      - name: latency-sampling
        type: latency
        latency:
          threshold_ms: 1000
      
      # Sample 1% of normal traffic
      - name: probabilistic-sampling
        type: probabilistic
        probabilistic:
          sampling_percentage: 1

This approach makes sampling decisions after spans are collected—ideal for always keeping errors and slow requests.

2. Security and Data Handling

Sensitive attributes must never leak into telemetry. You can sanitize them in code or enforce redaction at the collector.

In-code sanitization:

# Mask sensitive attributes
def sanitize_span_attributes(span, **attributes):
    for key, value in attributes.items():
        if key in ["password", "ssn", "credit_card", "token"]:
            span.set_attribute(key, "***REDACTED***")
        elif "email" in key.lower():
            span.set_attribute(key, mask_email(value))
        else:
            span.set_attribute(key, value)

# Usage
with tracer.start_as_current_span("user_operation") as span:
    sanitize_span_attributes(span,
        user_id="12345",
        user_email="user@example.com",  # Will be masked
        password="secret123"            # Will be redacted
    )

Collector-level redaction:

processors:
  redaction:
    allow_all_keys: false
    blocked_fields:
      - "user.password"
      - "user.ssn"
      - "credit_card.number"
    summary: "debug"

This ensures sensitive fields are removed before leaving your environment.

3. Performance Optimization

Keep tracing overhead low with selective instrumentation and batch tuning.

Conditional span creation:

# Use conditional instrumentation for high-frequency operations
class OptimizedTracer:
    def __init__(self, tracer):
        self.tracer = tracer
        
    def start_span_if_sampled(self, name, **kwargs):
        if self.should_sample():
            return self.tracer.start_as_current_span(name, **kwargs)
        return contextlib.nullcontext()
    
    def should_sample(self):
        current_span = trace.get_current_span()
        return current_span.get_span_context().trace_flags.sampled

This avoids unnecessary spans when sampling isn’t enabled.

Batch processor tuning:

# Optimize batch processor for your throughput
span_processor = BatchSpanProcessor(
    otlp_exporter,
    max_queue_size=2048,        # Increase for high-throughput
    export_timeout_millis=5000, # Timeout for export operations
    schedule_delay_millis=1000, # Batch export frequency
    max_export_batch_size=512   # Balance between latency and efficiency
)

Adjust these values to balance throughput and latency for your workload.

4. Monitoring OpenTelemetry Itself

Instrumentation should also be observable. Starting in 2024, SDK self-metrics help track its own performance.

# Monitor your instrumentation performance
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import ConsoleMetricExporter

meter_provider = MeterProvider()
meter = meter_provider.get_meter("otel-sdk-metrics")

# Track span creation rate
span_counter = meter.create_counter(
    "otel_spans_created_total",
    description="Total number of spans created"
)

# Track export failures
export_error_counter = meter.create_counter(
    "otel_export_errors_total", 
    description="Total export errors"
)

These metrics let you see if instrumentation itself is adding overhead or if exports are failing.

💡
With Last9 MCP, you can pull live traces, metrics, and logs into your development workflow, making it easier to spot bottlenecks and fix issues before they impact production.

Troubleshoot Common Issues

OpenTelemetry setups sometimes don’t behave as expected. Most problems fall into a handful of patterns—propagation gaps, misconfigured exporters, or resource bottlenecks. Let’s look at a few of the usual suspects and how to fix them.

Traces Don’t Connect Across Services

You see spans in your backend, but they aren’t stitched together into a full trace. This usually points to a propagation issue: the upstream service isn’t injecting headers, or the downstream service isn’t extracting them. Sometimes it’s as simple as one team using W3C Trace Context while another is still on B3.

  • Verify that every service both injects outgoing headers and extracts incoming ones.
  • Standardize on one propagation format—W3C Trace Context is the OpenTelemetry default and works well across most stacks.

Performance Overhead After Enabling Tracing

It’s not uncommon to notice slower startup times or higher CPU usage once tracing is turned on. The usual culprits are too many spans being created for high-frequency operations, or a batch processor that isn’t tuned for your workload.

  • Look at hot paths like loops or high-volume endpoints—are you creating a span on every iteration?
  • Use sampling to cut down on unnecessary spans, and adjust batch settings like max_queue_size or schedule_delay_millis to better match your throughput.

Data Not Showing Up in the Backend

Sometimes everything looks fine locally, but nothing lands in Jaeger, Last9, or whichever backend you’re using. In most cases, the exporter is misconfigured. Either the endpoint is wrong, TLS is blocking the connection, or an auth header is missing.

  • Try sending a request directly to the OTLP endpoint with curl to confirm it’s reachable.
  • Double-check exporter URLs, ports, and tokens. Small typos—like http:// instead of https://—are a surprisingly common cause.

Sampling Feels Off

You either end up with far too much data or not nearly enough. This usually happens when head-based sampling ratios are set too aggressively or tail-sampling rules are too strict.

  • Review your policies—are you only sampling errors and nothing else?
  • Start simple: always keep errors and slow traces, then add probabilistic rules (say 1–5% of normal traffic). Tune gradually instead of big swings.

Sensitive Data Slips Into Spans

Tracing is powerful, but you don’t want emails, tokens, or card numbers showing up in attributes. This happens if spans set attributes directly without sanitization, or if your collector isn’t filtering.

  • Scan a few traces—do you see PII in attributes?
  • Add sanitization in code for common fields, and back it up with collector processors that drop or mask sensitive attributes before export.

Collector Struggles Under Load

When traffic ramps up, the collector itself can become a bottleneck. You might notice dropped spans, backpressure, or even crashes.

  • Check Collector logs and self-metrics (otelcol_exporter_queue_size is a good one).
  • Increase queue sizes, raise batch limits, or shard workloads across multiple collectors. For heavy setups, running collectors close to services (sidecars or node agents) can help distribute load.

Most of these problems trace back to either config mismatches or scale tuning. Start small: confirm propagation, check exporter connectivity, and then dig into batch and sampling settings. The good news is that once you sort out these basics, OpenTelemetry tends to be stable and predictable in production.

Final Thoughts

We’ve walked through how tracing works end-to-end: spans and attributes to capture context, events to record milestones, propagation to connect services, and profiles to tie performance issues back to code. Together, these signals give you a detailed picture of how your system behaves in production.

The real question is what happens next. Once you start collecting rich telemetry, you need a place to explore it without losing fidelity or fighting query slowdowns. That’s where platforms like Last9 help.

  • Every attribute you set stays searchable, even during cardinality spikes.
  • Queries stay fast, whether you’re debugging a single payment failure or correlating across billions of spans.
  • Profiles, traces, logs, and metrics all live in one place, so you can move seamlessly from “this request was slow” to “this function burned CPU.”
  • With event-based pricing, you pay for the telemetry you send—not for hosts or user seats.

If you’ve been experimenting with OpenTelemetry, sending some of that data into Last9 alongside your existing setup is a straightforward next step, and one that shows the real value of keeping all that context intact.

Start for free today!

FAQs

What is OpenTelemetry distributed tracing?
OpenTelemetry distributed tracing is a method of tracking and visualizing the journey of a request as it moves through various services in a distributed system. It provides insights into how different components interact and helps identify performance bottlenecks or errors.

What's the difference between OpenTelemetry and Jaeger?

OpenTelemetry is a framework for collecting and exporting telemetry data, while Jaeger is a backend for storing and visualizing traces. OpenTelemetry sends data to Jaeger (among other backends).

Should I use eBPF or manual instrumentation?

  • eBPF: Best for getting started quickly, legacy applications, or when you can't modify code
  • Manual: Best for fine-grained control, custom business logic, and maximum observability depth

How does the new profiling signal work?

Profiling adds continuous performance monitoring that can be correlated with traces, metrics, and logs. You can jump from a slow trace span directly to the profiling data showing which code is consuming resources.

Is the Spring Boot Starter production-ready?

Yes, as of September 2024, the OpenTelemetry Spring Boot Starter is stable and production-ready, offering a lightweight alternative to the Java agent.

What's the performance impact of distributed tracing?

  • eBPF auto-instrumentation: <1% CPU, ~250MB memory
  • Manual instrumentation: 2-5% CPU overhead
  • Java agent: 3-8% CPU, can be higher during startup

Can I use OpenTelemetry with my existing monitoring tools?

Yes, OpenTelemetry is vendor-neutral and can export to virtually any observability backend, including Prometheus, Grafana, Datadog, New Relic, and more.

How do I handle sensitive data in traces?

Use attribute processors to redact sensitive data, implement span-level sanitization, and configure your collector to filter or mask sensitive information before export.

What languages have the best OpenTelemetry support?

Java, .NET, Python, and Node.js have the most mature support with hundreds of auto-instrumentation libraries. Go and Rust support is rapidly improving with eBPF-based solutions.

Authors
Prathamesh Sonpatki

Prathamesh Sonpatki

Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

X

Contents

Do More with Less

Unlock unified observability and faster triaging for your team.