Advanced OpenTelemetry: Sampling, Filtering, and Enrichment

Introduction

OpenTelemetry has revolutionized the way we approach observability in distributed systems. While basic setups can provide valuable insights, advanced configurations unlock the full potential of OpenTelemetry, allowing for more efficient resource usage and more targeted data collection.

This guide delves into three critical aspects of advanced OpenTelemetry configurations:

Sampling: Reducing data volume while maintaining statistical accuracy
Filtering: Focusing on the most relevant data
Data Enrichment: Adding context to enhance the value of collected telemetry

Whether you're managing a large-scale production environment or optimizing a growing system, these techniques will help you fine-tune your observability pipeline.

1. Sampling Strategies in OpenTelemetry

Sampling is crucial for managing the volume of telemetry data in high-throughput systems. It allows you to reduce the amount of data collected and transmitted while still maintaining a statistically representative view of your system's behavior.

Types of Sampling

a. Head-Based Sampling

Head-based sampling makes the sampling decision at the beginning of a trace.

Example configuration (in Go):

import (
    "go.opentelemetry.io/otel/sdk/trace"
    "go.opentelemetry.io/otel/sdk/trace/sampler"
)

// Sample 25% of traces
sampler := sampler.TraceIDRatioBased(0.25)
tp := trace.NewTracerProvider(
    trace.WithSampler(sampler),
)

b. Tail-Based Sampling

Tail-based sampling makes the decision after the entire trace is complete. This is typically implemented in the collector.

Example collector configuration:

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100
    expected_new_traces_per_sec: 10
    policies:
      [
        {
          name: error-in-trace,
          type: status_code,
          status_code: ERROR
        },
        {
          name: probability-sampler,
          type: probabilistic,
          sampling_percentage: 10
        }
      ]

c. Probabilistic Sampling

Probabilistic sampling randomly selects a percentage of traces.

Example in Python:

from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Sample 50% of traces
sampler = TraceIdRatioBased(0.5)

d. Rate-Limiting Sampling

Rate-limiting sampling caps the number of samples per unit of time.

Example collector configuration:

processors:
  batch:
    timeout: 10s
    send_batch_size: 10000
  rate_limiting:
    spans_per_second: 1000

📖

While OpenTelemetry is the preferred choice today, understanding its roots in OpenTracing is essential. Learn about the differences and migration steps in our in-depth comparison.

Best Practices and Trade-offs

Start with a higher sampling rate and adjust downwards as needed.
Use different sampling strategies for different services or endpoints.
Monitor the impact of sampling on your ability to detect and diagnose issues.

If you can, utilize the OTLP Ingest Pipelines from last9.io where you can do all of this at runtime, instead of design-time.

2. Filtering Telemetry Data

Filtering allows you to focus on the most relevant data, reducing noise and storage costs.

Filtering at the SDK Level

Example in Java:

SdkTracerProvider sdkTracerProvider = SdkTracerProvider.builder()
    .addSpanProcessor(
        SimpleSpanProcessor.create(
            new SpanExporter() {
                @Override
                public CompletableResultCode export(Collection<SpanData> spans) {
                    spans.removeIf(span -> span.getName().equals("health_check"));
                    // Export the remaining spans
                    return CompletableResultCode.ofSuccess();
                }
                // ... other methods ...
            }))
    .build();

Filtering in the OpenTelemetry Collector

Example collector configuration:

processors:
  filter:
    metrics:
      include:
        match_type: regexp
        metric_names:
          - .*important.*
    spans:
      exclude:
        attributes:
          - key: http.url
            value: .*health_check.*

3. Data Enrichment Techniques

Data enrichment adds context to your telemetry data, making it more valuable for analysis.

Adding Metadata from Environment Variables

Example in Python:

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
import os

resource = Resource(attributes={
    "service.name": "my-service",
    "deployment.environment": os.getenv("DEPLOYMENT_ENV", "unknown")
})

tracer_provider = trace.TracerProvider(resource=resource)
trace.set_tracer_provider(tracer_provider)

Incorporating Data from External Sources

Example using a custom processor in the collector:

import (
    "context"
    "go.opentelemetry.io/collector/pdata/pcommon"
    "go.opentelemetry.io/collector/pdata/ptrace"
)

type ExternalEnricher struct {
    // ... configuration fields
}

func (e *ExternalEnricher) ProcessTraces(ctx context.Context, td ptrace.Traces) (ptrace.Traces, error) {
    for i := 0; i < td.ResourceSpans().Len(); i++ {
        rs := td.ResourceSpans().At(i)
        // Fetch data from external source based on resource attributes
        externalData := fetchExternalData(rs.Resource().Attributes())
        rs.Resource().Attributes().PutStr("external.data", externalData)
    }
    return td, nil
}

// ... implement other required methods

🗒️

Learn how to effectively instrument your Golang application for comprehensive observability, capturing valuable metrics and traces.

4. Advanced Processor Configurations

The OpenTelemetry Collector offers various processors for data manipulation. Here's a deep dive into key processors:

Attribute Processor

processors:
  attributes:
    actions:
      - key: db.statement
        action: delete
      - key: credit_card
        action: hash
      - key: email
        action: extract
        pattern: '^(?P<username>[^@]+)@'

This configuration removes the db.statement attribute hashes the credit_card attribute for privacy, and extracts the username from email addresses.

Resource Processor

processors:
  resource:
    attributes:
      - key: cloud.availability_zone
        value: zone-1
        action: upsert
      - key: k8s.cluster.name
        from_attribute: k8s.cluster
        action: insert

This adds a cloud availability zone to all telemetry and copies the Kubernetes cluster name from one attribute to another.

Span Processor

processors:
  span:
    name:
      to_attributes:
        rules:
          - ^\/api\/v1\/([^\/]+)\/([^\/]+)
    include:
      match_type: regexp
      attributes:
        - key: http.url
          value: .*

This extracts parts of the span name into attributes and only processes spans with an http.url attribute.

5. Performance Considerations

When implementing advanced configurations, keep these performance considerations in mind:

Monitor the resource usage of your collector instances.
Use batch processing to reduce network overhead.
Implement circuit breakers to handle backpressure.

Example collector configuration with performance optimizations:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

exporters:
  otlp:
    endpoint: backend.example.com:4317
    sending_queue:
      enabled: true
      num_consumers: 10
      queue_size: 5000
    retry_on_failure:
      enabled: true
      initial_interval: 5s
      max_interval: 30s
      max_elapsed_time: 300s

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]

6. Practical Examples

Here's a scenario demonstrating these advanced configurations:

Imagine you're running a high-traffic e-commerce platform. You want to:

Sample 10% of all traces, but capture all errors
Filter out health check endpoints
Enrich data with user segments
Ensure efficient resource usage

Here's a collector configuration that achieves this:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  tail_sampling:
    decision_wait: 10s
    num_traces: 100
    expected_new_traces_per_sec: 1000
    policies:
      [
        {
          name: errors,
          type: status_code,
          status_code: ERROR
        },
        {
          name: probability-sampler,
          type: probabilistic,
          sampling_percentage: 10
        }
      ]
  filter:
    spans:
      exclude:
        attributes:
          - key: http.url
            value: .*/health
  attributes:
    actions:
      - key: user.segment
        from_attribute: user.id
        action: insert
        mapping:
          "123": "premium"
          "456": "standard"
          "*": "free"

exporters:
  otlp:
    endpoint: backend.example.com:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, tail_sampling, filter, attributes]
      exporters: [otlp]

7. Conclusion

Advanced OpenTelemetry configurations allow you to fine-tune your observability pipeline, balancing data quality with system performance. By implementing sampling, filtering, and data enrichment, you can focus on the most valuable telemetry data while managing costs and resource usage.

Remember to:

Regularly review and adjust your configurations
Monitor the performance impact of your settings
Keep up with the latest OpenTelemetry features and best practices

With these advanced techniques, you'll be well-equipped to handle the observability challenges of complex, distributed systems.

8. Additional Resources

OpenTelemetry Documentation
OpenTelemetry Collector Configuration
OpenTelemetry Governance Committee
OpenTelemetry Discussion Forums
OpenTelemetry Registry for related tools and libraries

Remember, the field of observability is constantly evolving. Stay engaged with the OpenTelemetry community to keep your skills and implementations up-to-date!

Advanced OpenTelemetry: Sampling, Filtering, and Enrichment

Contents

Introduction

1. Sampling Strategies in OpenTelemetry

Types of Sampling

a. Head-Based Sampling

b. Tail-Based Sampling

c. Probabilistic Sampling

d. Rate-Limiting Sampling

Best Practices and Trade-offs

2. Filtering Telemetry Data

Filtering at the SDK Level

Filtering in the OpenTelemetry Collector

3. Data Enrichment Techniques

Adding Metadata from Environment Variables

Incorporating Data from External Sources

4. Advanced Processor Configurations

Attribute Processor

Resource Processor

Span Processor

5. Performance Considerations

6. Practical Examples

7. Conclusion

8. Additional Resources

Contents

Do More with Less

Handcrafted Related Posts

How to Integrate OpenTelemetry Collector with Prometheus

Traceparent: How OpenTelemetry Connects Your Microservices

OpenTelemetry vs Micrometer: Here’s How to Decide