Introduction
OpenTelemetry has revolutionized the way we approach observability in distributed systems. While basic setups can provide valuable insights, advanced configurations unlock the full potential of OpenTelemetry, allowing for more efficient resource usage and more targeted data collection.
This guide delves into three critical aspects of advanced OpenTelemetry configurations:
- Sampling: Reducing data volume while maintaining statistical accuracy
- Filtering: Focusing on the most relevant data
- Data Enrichment: Adding context to enhance the value of collected telemetry
Whether you're managing a large-scale production environment or optimizing a growing system, these techniques will help you fine-tune your observability pipeline.
1. Sampling Strategies in OpenTelemetry
Sampling is crucial for managing the volume of telemetry data in high-throughput systems. It allows you to reduce the amount of data collected and transmitted while still maintaining a statistically representative view of your system's behavior.
Types of Sampling
a. Head-Based Sampling
Head-based sampling makes the sampling decision at the beginning of a trace.
Example configuration (in Go):
import (
"go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/otel/sdk/trace/sampler"
)
// Sample 25% of traces
sampler := sampler.TraceIDRatioBased(0.25)
tp := trace.NewTracerProvider(
trace.WithSampler(sampler),
)
b. Tail-Based Sampling
Tail-based sampling makes the decision after the entire trace is complete. This is typically implemented in the collector.
Example collector configuration:
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 10
policies:
[
{
name: error-in-trace,
type: status_code,
status_code: ERROR
},
{
name: probability-sampler,
type: probabilistic,
sampling_percentage: 10
}
]
c. Probabilistic Sampling
Probabilistic sampling randomly selects a percentage of traces.
Example in Python:
from opentelemetry.sdk.trace.sampling import TraceIdRatioBased
# Sample 50% of traces
sampler = TraceIdRatioBased(0.5)
d. Rate-Limiting Sampling
Rate-limiting sampling caps the number of samples per unit of time.
Example collector configuration:
processors:
batch:
timeout: 10s
send_batch_size: 10000
rate_limiting:
spans_per_second: 1000
📖
While OpenTelemetry is the preferred choice today, understanding its roots in OpenTracing is essential. Learn about the differences and migration steps in our
in-depth comparison.Best Practices and Trade-offs
- Start with a higher sampling rate and adjust downwards as needed.
- Use different sampling strategies for different services or endpoints.
- Monitor the impact of sampling on your ability to detect and diagnose issues.
If you can, utilize the OTLP Ingest Pipelines from last9.io where you can do all of this at runtime, instead of design-time.
2. Filtering Telemetry Data
Filtering allows you to focus on the most relevant data, reducing noise and storage costs.
Filtering at the SDK Level
Example in Java:
SdkTracerProvider sdkTracerProvider = SdkTracerProvider.builder()
.addSpanProcessor(
SimpleSpanProcessor.create(
new SpanExporter() {
@Override
public CompletableResultCode export(Collection<SpanData> spans) {
spans.removeIf(span -> span.getName().equals("health_check"));
// Export the remaining spans
return CompletableResultCode.ofSuccess();
}
// ... other methods ...
}))
.build();
Filtering in the OpenTelemetry Collector
Example collector configuration:
processors:
filter:
metrics:
include:
match_type: regexp
metric_names:
- .*important.*
spans:
exclude:
attributes:
- key: http.url
value: .*health_check.*
3. Data Enrichment Techniques
Data enrichment adds context to your telemetry data, making it more valuable for analysis.
Example in Python:
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
import os
resource = Resource(attributes={
"service.name": "my-service",
"deployment.environment": os.getenv("DEPLOYMENT_ENV", "unknown")
})
tracer_provider = trace.TracerProvider(resource=resource)
trace.set_tracer_provider(tracer_provider)
Incorporating Data from External Sources
Example using a custom processor in the collector:
import (
"context"
"go.opentelemetry.io/collector/pdata/pcommon"
"go.opentelemetry.io/collector/pdata/ptrace"
)
type ExternalEnricher struct {
// ... configuration fields
}
func (e *ExternalEnricher) ProcessTraces(ctx context.Context, td ptrace.Traces) (ptrace.Traces, error) {
for i := 0; i < td.ResourceSpans().Len(); i++ {
rs := td.ResourceSpans().At(i)
// Fetch data from external source based on resource attributes
externalData := fetchExternalData(rs.Resource().Attributes())
rs.Resource().Attributes().PutStr("external.data", externalData)
}
return td, nil
}
// ... implement other required methods
4. Advanced Processor Configurations
The OpenTelemetry Collector offers various processors for data manipulation. Here's a deep dive into key processors:
Attribute Processor
processors:
attributes:
actions:
- key: db.statement
action: delete
- key: credit_card
action: hash
- key: email
action: extract
pattern: '^(?P<username>[^@]+)@'
This configuration removes the db.statement attribute hashes the credit_card attribute for privacy, and extracts the username from email addresses.
Resource Processor
processors:
resource:
attributes:
- key: cloud.availability_zone
value: zone-1
action: upsert
- key: k8s.cluster.name
from_attribute: k8s.cluster
action: insert
This adds a cloud availability zone to all telemetry and copies the Kubernetes cluster name from one attribute to another.
Span Processor
processors:
span:
name:
to_attributes:
rules:
- ^\/api\/v1\/([^\/]+)\/([^\/]+)
include:
match_type: regexp
attributes:
- key: http.url
value: .*
This extracts parts of the span name into attributes and only processes spans with an http.url attribute.
When implementing advanced configurations, keep these performance considerations in mind:
- Monitor the resource usage of your collector instances.
- Use batch processing to reduce network overhead.
- Implement circuit breakers to handle backpressure.
Example collector configuration with performance optimizations:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
otlp:
endpoint: backend.example.com:4317
sending_queue:
enabled: true
num_consumers: 10
queue_size: 5000
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 300s
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp]
6. Practical Examples
Here's a scenario demonstrating these advanced configurations:
Imagine you're running a high-traffic e-commerce platform. You want to:
- Sample 10% of all traces, but capture all errors
- Filter out health check endpoints
- Enrich data with user segments
- Ensure efficient resource usage
Here's a collector configuration that achieves this:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 1s
send_batch_size: 1024
tail_sampling:
decision_wait: 10s
num_traces: 100
expected_new_traces_per_sec: 1000
policies:
[
{
name: errors,
type: status_code,
status_code: ERROR
},
{
name: probability-sampler,
type: probabilistic,
sampling_percentage: 10
}
]
filter:
spans:
exclude:
attributes:
- key: http.url
value: .*/health
attributes:
actions:
- key: user.segment
from_attribute: user.id
action: insert
mapping:
"123": "premium"
"456": "standard"
"*": "free"
exporters:
otlp:
endpoint: backend.example.com:4317
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, tail_sampling, filter, attributes]
exporters: [otlp]
7. Conclusion
Advanced OpenTelemetry configurations allow you to fine-tune your observability pipeline, balancing data quality with system performance. By implementing sampling, filtering, and data enrichment, you can focus on the most valuable telemetry data while managing costs and resource usage.
Remember to:
- Regularly review and adjust your configurations
- Monitor the performance impact of your settings
- Keep up with the latest OpenTelemetry features and best practices
With these advanced techniques, you'll be well-equipped to handle the observability challenges of complex, distributed systems.
8. Additional Resources
Remember, the field of observability is constantly evolving. Stay engaged with the OpenTelemetry community to keep your skills and implementations up-to-date!