You're monitoring a microservices-based system. Alerts trigger when response times exceed 2 seconds. But when you open Jaeger, you're faced with thousands of traces. Identifying which service or operation is responsible becomes time-consuming.
Jaeger metrics help reduce this friction by exposing aggregated telemetry. Instead of scanning individual traces, you get service-level and operation-level performance metrics, latency, throughput, and error rates that highlight where the issue lies.
This blog covers how to:
- Enable and export internal Jaeger metrics via Prometheus
- Set up Service Performance Monitoring (SPM) in Jaeger
These provide a faster path from alert to root cause by bridging the gap between high-level monitoring and detailed trace analysis.
Why Jaeger Metrics Matter in Production
In most production setups, debugging with tracing looks like this:
An alert fires → you check your dashboards → you jump into Jaeger to look at traces.
But here’s the problem: traces are granular. They tell you what happened to a request, not what’s happening across requests. If you’ve ever sat staring at hundreds of traces wondering what to even filter for, you’ve felt that gap. That’s where Jaeger metrics help you.
What You Get from Jaeger Metrics
1. Infra-level signals
Jaeger exposes internal metrics about its components, like:
- Dropped spans (due to memory limits or misconfig)
- Queuing delays
- Query latencies
- Storage backend issues
Useful when tracing is set up, but you’re not seeing the expected data. These metrics help answer: Is Jaeger receiving and processing spans, or is something choking?
2. Trace-derived RED metrics
Jaeger can also give you:
- Request rate
- Error rate
- Duration (latency)
And not just per-service, you can get operation-level breakdowns. These come directly from trace data, so you get visibility without setting up separate app-level metrics.
Why is this important:
- You can spot which endpoints are slowing down before you dig into traces.
- You get a sense of baseline latency and volume trends.
- Error spikes are easier to correlate with deployments or incidents.
Let’s say latency’s up. Instead of clicking through trace after trace, you can ask:
- Which service is contributing the most to p99 latency?
- Is this a frontend issue or a downstream service bottleneck?
- Are we dropping spans, or is the slowness real?
That’s faster than guessing which tags to filter by.
In short: Traces are great when you know what you’re looking for. Metrics are how you find what to look at. Jaeger metrics give you that missing middle, bridging high-level alerts and low-level trace detail.
Types of Metrics Jaeger Exposes
Jaeger emits two main categories of metrics — one focused on the tracing infrastructure itself, and the other on application performance using trace data.
1. Internal Metrics: Monitoring Jaeger Itself
These metrics give visibility into the health of the Jaeger components — Collector, Agent, Query, and Storage. They’re essential for making sure your tracing pipeline isn’t silently failing or lagging.
Runtime-level metrics:
- Memory and CPU usage
- Uptime, garbage collection, and resource usage per process
- Binary version info for tracking deployments or mismatches
Operational metrics:
jaeger_collector_spans_received_total
: Total spans receivedjaeger_collector_spans_saved_total
: Successfully written spansjaeger_collector_spans_dropped_total
: Spans dropped due to queue or buffer overflowjaeger_collector_queue_length
: Queue backlog for span processingjaeger_query_requests_total
: Number of requests to the query service
These metrics help answer:
- Is Jaeger ingesting and persisting spans properly?
- Is anything backing up or dropping under load?
- Are users hitting performance issues in the query layer?
When to use:
Always. If Jaeger is in your critical path, you need this visibility to avoid silent failures in your observability system.
2. Service Performance Monitoring: Metrics from Traces
Jaeger can aggregate span data into metrics at both the service and operation levels. These span-derived metrics let you monitor system behavior without instrumenting everything twice.
Service-level metrics:
- Request rate: How many spans are emitted per service
- Error rate: Percentage of spans with error tags
- Latency percentiles: P50, P75, P95 response times
Operation-level metrics:
- Same RED metrics, but broken down by operation name
- Impact score: Combines latency and volume to surface high-impact endpoints
This lets you track performance over time, detect regressions, and catch issues before they hit alert thresholds.
When to use:
Use these when you need proactive visibility into service behavior, especially useful for teams running multiple microservices, where digging into traces first isn’t scalable.
Decision Tree: Which Setup Do You Need?
Not every team needs the full Jaeger + SPM + OpenTelemetry Collector stack right away. Your tracing setup should match your system’s complexity and your team’s needs. Here’s how to decide.
If you're just getting started with Jaeger, start simple. Focus on internal metrics:
- Collector health
- Dropped spans
- Query performance
These give you the operational confidence that Jaeger itself is running reliably. Before you dive into trace analysis or RED metrics, make sure the tracing system isn’t silently failing.
If you're running Jaeger in production with multiple services, it's time to layer on Service Performance Monitoring (SPM). At this scale, it’s no longer practical to open up individual traces for every issue. SPM lets you:
- Catch latency spikes and error trends early
- Monitor service-level and operation-level performance
- Prioritize what to investigate before querying traces
For teams dealing with high trace volume or complex microservices, go further: use the OpenTelemetry Collector alongside SPM. The Collector helps you:
- Buffer and batch spans to avoid overwhelming storage
- Apply filters and sampling rules
- Generate aggregated metrics efficiently
This setup makes trace data usable at scale, especially when dealing with thousands of spans per second.
Already have detailed metrics from Prometheus or another observability stack?
Then ask: Does SPM add anything new? If your current metrics miss per-operation detail, like latency and error rates broken down by endpoint, then SPM fills that gap.
Setup Internal Metrics with Jaeger
Compatibility: Jaeger 1.35+
Internal metrics give you visibility into Jaeger's performance: whether spans are being dropped, queues are backing up, or queries are getting slow. This setup works across all Jaeger deployment types.
1. Configure Jaeger to Expose Prometheus Metrics
All Jaeger components support Prometheus as a metrics backend. Add these flags to expose metrics on an HTTP endpoint:
# All-in-one (dev/test)
jaeger-all-in-one --metrics-backend=prometheus --metrics-http-route=/metrics
# Production components
jaeger-collector --metrics-backend=prometheus --metrics-http-route=/metrics
jaeger-query --metrics-backend=prometheus --metrics-http-route=/metrics
jaeger-agent --metrics-backend=prometheus --metrics-http-route=/metrics
Each component exposes metrics on its own port. For example, jaeger-all-in-one
uses the port 14269
for metrics.
2. Docker Compose Example: Jaeger + Prometheus
For local testing or dev environments, here's a minimal setup that wires everything together:
version: '3.8'
services:
jaeger:
image: jaegertracing/all-in-one:1.50
ports:
- "16686:16686" # Jaeger UI
- "14269:14269" # Prometheus metrics endpoint
command:
- "--metrics-backend=prometheus"
- "--metrics-http-route=/metrics"
environment:
- COLLECTOR_OTLP_ENABLED=true
prometheus:
image: prom/prometheus:v2.40.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
3. Prometheus Configuration
Configure Prometheus to scrape Jaeger’s metrics endpoint:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'jaeger'
static_configs:
- targets: ['jaeger:14269']
scrape_interval: 5s
metrics_path: '/metrics'
What to Expect
Once spans are flowing, you’ll see metrics like:
jaeger_collector_spans_received_total
jaeger_collector_spans_dropped_total
jaeger_collector_queue_length
jaeger_query_requests_total
The /metrics
endpoint is live as soon as Jaeger starts, but you won’t see meaningful metrics until spans are being processed. If you see only zeros, double-check that your app is sending data.
Set Up Service Performance Monitoring (SPM)
Compatibility: Jaeger 1.35+, OpenTelemetry Collector Contrib 0.60+
SPM gives you RED metrics (request rate, error rate, duration) directly from trace data, no need to instrument your code again just for metrics. But it requires more moving parts. You’ll need an OpenTelemetry Collector configured to receive spans, generate metrics, and export them to Prometheus.
Why You Need the OpenTelemetry Collector
Jaeger alone stores and serves spans. It doesn’t produce time-series metrics.
To generate metrics from spans, you need something that:
- Reads spans
- Aggregates them by service and operation
- Emits metrics on a schedule
The SpanMetrics connector in the OpenTelemetry Collector does exactly that.
Step 1: Configure the OpenTelemetry Collector
Here’s a minimal otel-collector-config.yaml
:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
limit_mib: 1000
spike_limit_mib: 200
check_interval: 5s
connectors:
spanmetrics:
histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms, 500ms, 1s, 2s, 5s]
dimensions:
- name: http.method
default: GET
- name: http.status_code
default: "200"
- name: service.name
- name: operation
dimensions_cache_size: 1000
aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
metrics_flush_interval: 30s
exporters:
otlp:
endpoint: jaeger:4317
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
namespace: "jaeger_spm"
const_labels:
service_name: "jaeger_spm"
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp, spanmetrics]
metrics:
receivers: [spanmetrics]
processors: [memory_limiter, batch]
exporters: [prometheus]
extensions: [health_check]
Notes:
histogram_buckets
: Adjust to fit your app’s latency patterns.dimensions
: Only add what you need. More dimensions = more memory.metrics_flush_interval
: Controls how often metrics get exported.dimensions_cache_size
: Prevents unbounded cardinality.
Step 2: Connect Jaeger to Prometheus for Metrics
Jaeger Query needs to be configured to use Prometheus as a backend for metrics. Add the following to the Jaeger service:
environment:
- COLLECTOR_OTLP_ENABLED=true
- METRICS_STORAGE_TYPE=prometheus
- PROMETHEUS_SERVER_URL=http://prometheus:9090
- PROMETHEUS_QUERY_SUPPORT_SPANMETRICS_CONNECTOR=true
command:
- "--query.max-clock-skew-adjustment=30s"
The max-clock-skew-adjustment
setting helps align timestamps between trace spans and the derived metrics.
Step 3: Use Docker Compose to Run Everything
Here’s a full working setup that runs Jaeger, the OpenTelemetry Collector, and Prometheus together:
version: '3.8'
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.88.0
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317"
- "4318:4318"
- "8889:8889"
- "13133:13133"
depends_on:
- jaeger
- prometheus
restart: unless-stopped
jaeger:
image: jaegertracing/all-in-one:1.50
ports:
- "16686:16686"
- "14269:14269"
environment:
- COLLECTOR_OTLP_ENABLED=true
- METRICS_STORAGE_TYPE=prometheus
- PROMETHEUS_SERVER_URL=http://prometheus:9090
- PROMETHEUS_QUERY_SUPPORT_SPANMETRICS_CONNECTOR=true
command:
- "--query.max-clock-skew-adjustment=30s"
depends_on:
- prometheus
restart: unless-stopped
prometheus:
image: prom/prometheus:v2.40.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
restart: unless-stopped
Step 4: Set Up Prometheus to Scrape Metrics
Update your Prometheus config to scrape metrics from Jaeger and the OTel Collector:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'jaeger'
static_configs:
- targets: ['jaeger:14269']
scrape_interval: 5s
metrics_path: '/metrics'
- job_name: 'otel-collector'
static_configs:
- targets: ['otel-collector:8889']
scrape_interval: 5s
metrics_path: '/metrics'
- job_name: 'otel-collector-internal'
static_configs:
- targets: ['otel-collector:8888']
scrape_interval: 30s
metrics_path: '/metrics'
What This Setup Gives You
Once everything is connected and spans start flowing:
- You’ll see service-level RED metrics: requests per second, error rates, and latency percentiles
- You’ll get operation-level breakdowns across dimensions like HTTP method and status
- Jaeger’s UI will show span metrics alongside traces, helping you move from high-level patterns to deep request analysis
This setup turns your trace firehose into actionable metrics. If your team manages more than a handful of services, this is where Jaeger becomes more than just a trace viewer; it becomes an observability system.
Understand SPM Metrics in Production
Now that Service Performance Monitoring is set up and spans are flowing through the OpenTelemetry Collector, you'll start seeing a series of jaeger_spm_*
metrics in Prometheus. These represent service-level and operation-level behavior based entirely on trace data.
Core SPM Metrics
These are the primary metrics generated by the spanmetrics
connector:
jaeger_spm_calls_total
: Total number of spans received (used for request rate)jaeger_spm_duration_bucket
: Duration histogram (used for latency percentiles)jaeger_spm_duration_count
: Count of durations (used for average/request rate calculation)jaeger_spm_duration_sum
: Total duration (used to calculate average latency)
Useful PromQL Queries for Dashboards
These are common queries to visualize performance in Grafana or Prometheus dashboards.
Request rate per service (last 5 minutes):
sum(rate(jaeger_spm_calls_total[5m])) by (service_name)
Error rate per service:
sum(rate(jaeger_spm_calls_total{status_code!~"2.."}[5m])) by (service_name)
/
sum(rate(jaeger_spm_calls_total[5m])) by (service_name)
95th percentile latency by service:
histogram_quantile(0.95,
sum(rate(jaeger_spm_duration_bucket[5m])) by (service_name, le)
)
Top 10 highest-impact operations (latency × throughput):
topk(10,
histogram_quantile(0.95,
sum(rate(jaeger_spm_duration_bucket[5m])) by (service_name, operation, le)
)
*
sum(rate(jaeger_spm_calls_total[5m])) by (service_name, operation)
)
How to Interpret the Impact Metric
The “impact” metric is a practical proxy for prioritization. It combines latency and throughput to help identify where optimization will have the greatest effect.
- High impact + High latency → Critical optimization target
- High impact + Low latency → High-throughput path, optimize for efficiency
- Low impact + High latency → Often fine (e.g., batch jobs or admin operations)
- Low impact + Low latency → Likely not worth investigating
Use Jaeger’s REST API to Query Metrics
If you’re not using Prometheus directly, or you want to power custom dashboards or tooling without writing PromQL, you can query SPM data through Jaeger’s built-in REST API. These endpoints mirror what you see in the Jaeger UI but are accessible programmatically.
Example API Calls
1. Get request volume by service:
curl "http://localhost:16686/api/metrics/calls?service=frontend&lookback=1h&step=60s"
2. Get error rate by service:
curl "http://localhost:16686/api/metrics/errors?service=frontend&lookback=1h&step=60s"
3. Get latency quantile (e.g., P95):
curl "http://localhost:16686/api/metrics/latencies?service=frontend&lookback=1h&step=60s&quantile=0.95"
4. Get metrics for a specific operation:
curl "http://localhost:16686/api/metrics/calls?service=frontend&operation=GET%20/api/users&lookback=1h"
Pros and Limitations
Pros
- Simple to integrate with custom dashboards, CLIs, or scripts
- Built-in filtering by service and operation
- Output matches what Jaeger UI graphs display
Limitations
- Limited flexibility compared to PromQL
- No advanced grouping, rate calculations, or histogram quantiles
- Lookback windows and step intervals must be explicitly managed
Connect Metrics to Traces for Faster Debugging
Traces tell you what happened. Metrics tell you how often. But real observability comes from linking the two, using metrics to surface problems, then jumping directly into trace data to debug them.
1. Trigger Trace Analysis from Alerts
Set up alerting rules on Jaeger SPM metrics. When a threshold is breached—like a sudden spike in errors—you can jump directly into Jaeger with pre-filtered trace queries.
Example: Alert on high error rate
- alert: HighServiceErrorRate
expr: |
sum(rate(jaeger_spm_calls_total{status_code!~"2.."}[5m])) by (service_name) /
sum(rate(jaeger_spm_calls_total[5m])) by (service_name) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate for {{ $labels.service_name }}"
description: "Error rate is {{ $value | humanizePercentage }}"
jaeger_query: "http://jaeger:16686/search?service={{ $labels.service_name }}&lookback=1h&tags=error%3Atrue"
The jaeger_query
annotation can be used in alert receivers or UIs to go directly from an alert to filtered traces.
2. Link Metrics to Traces in Dashboards
Dashboards can do more than visualize metrics. You can wire them up to link back into Jaeger, letting developers move from charts to traces in one click.
Example: Grafana panel with trace links
{
"title": "Service Performance with Trace Links",
"panels": [
{
"title": "Request Rate by Service",
"targets": [
{
"expr": "sum(rate(jaeger_spm_calls_total[5m])) by (service_name)",
"legendFormat": "{{service_name}}"
}
],
"links": [
{
"title": "View Traces",
"url": "http://localhost:16686/search?service=${__field.labels.service_name}&lookback=1h"
}
]
}
]
}
You get quick navigation: a spike in the graph → one click → trace search in Jaeger UI with the correct filters.
3. Correlate Metrics and Traces in Code
When both metrics and traces are emitted from the same instrumentation layer, you gain powerful context: every spike in a metric can be tied to a specific trace or group of traces.
Here’s a simplified example using OpenTelemetry in Python:
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
import time
# Set up tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True))
trace.get_tracer_provider().add_span_processor(span_processor)
# Set up metrics
metrics.set_meter_provider(MeterProvider(
metric_readers=[PeriodicExportingMetricReader(
OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True),
export_interval_millis=5000
)]
))
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("http_requests_total", unit="1")
request_duration = meter.create_histogram("http_request_duration_seconds", unit="s")
# Instrument a request handler
def handle_request(request):
with tracer.start_as_current_span("handle_request") as span:
span.set_attribute("http.method", request.method)
span.set_attribute("http.url", request.url)
start_time = time.time()
try:
response = process_request(request)
span.set_attribute("http.status_code", response.status_code)
duration = time.time() - start_time
request_counter.add(1, {
"method": request.method,
"endpoint": request.endpoint,
"status_code": str(response.status_code)
})
request_duration.record(duration, {
"method": request.method,
"endpoint": request.endpoint
})
return response
except Exception as e:
span.record_exception(e)
span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
request_counter.add(1, {
"method": request.method,
"endpoint": request.endpoint,
"status_code": "500"
})
raise
When tracing and metrics share labels like method
, endpoint
, or status_code
, the same context exists in both worlds, so jumping between dashboards and traces makes sense.
Handle High Cardinality in SPM Metrics
Service Performance Monitoring (SPM) is powerful, but if misconfigured, it can generate a flood of time series that crush your Prometheus setup. This section walks through why that happens and how to stay in control.
Why Cardinality Explodes
Each unique combination of dimension values in the SpanMetrics connector results in a new time series. It adds up quickly:
- 10 services × 5 operations × 10 status codes = 500 series
- Add
user_id
→ 500 × number of users = millions of series
Some labels (like trace.id
, user.id
, or request.id
) are naturally high-cardinality and should be avoided as dimensions. They’re useful in traces, not in aggregated metrics.
Strategies to Keep Cardinality Manageable
1. Limit Dimensions to What’s Operationally Useful
Stick to dimensions that help during debugging or SLO monitoring. In your OpenTelemetry Collector config:
connectors:
spanmetrics:
dimensions:
- name: service.name
- name: operation # Good for route-level visibility
- name: http.method # GET, POST, etc.
- name: http.status_code # Useful for error patterns
# Avoid adding user.id, trace.id, or request.id here
Avoid the temptation to over-tag. If a label isn’t going to be part of a query or dashboard filter, don’t promote it to a dimension.
2. Use Caching and Batching Controls
To prevent memory exhaustion in the collector:
connectors:
spanmetrics:
dimensions_cache_size: 1000
aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
The dimensions_cache_size
setting limits how many combinations will be stored in memory. If the cache is full, new time series will be dropped, avoiding OOM errors.
3. Monitor Collector Memory and Series Growth
Keep an eye on the internal metrics exposed by the collector:
curl http://localhost:8888/metrics | grep otelcol_processor_spanmetrics
Watch for indicators like:
- High
spanmetrics_active_timeseries
- Memory spikes in the process metrics
How to Know When You Have a Cardinality Problem
You’ll usually notice cardinality issues in the form of:
- OpenTelemetry Collector using more memory over time
- Prometheus showing spikes in ingestion rate
- Dashboards slowing down or timing out
- Storage usage increasing rapidly, even with relatively stable traffic
If any of these happen, cardinality is a likely culprit.
Solutions When Limits Are Hit
- Reduce dimensions in the SpanMetrics config. This is the fastest way to cut the series volume.
- Apply sampling upstream for services with very high traffic.
- Use an observability platform built for high-cardinality workloads.
What to Do When Prometheus Isn’t Enough
If you're already doing all of the above and Prometheus still struggles, consider offloading to a managed backend that’s designed to handle high-cardinality telemetry.
Last9 integrates natively with OpenTelemetry and Prometheus, offering streaming metric aggregation and long-term retention, without blowing up your infra budget.
Teams at Probo, CleverTap, and Replit trust Last9 for better observability and performance. The platform bridges metrics, logs, and traces without the operational complexity of maintaining TSDB tuning, retention policies, or scale-out storage clusters.

Troubleshooting Common Issues
At times, even with a working Jaeger and OpenTelemetry setup, things can break. Dashboards may stop showing data, memory usage might spike, or traces could appear inconsistent.
Here are the most common failure modes and how to fix them:
No Data in the SPM Dashboard
If your SPM dashboard is empty, the issue could be anywhere in the trace-to-metric pipeline. Step through each layer to isolate the problem.
1. Check if the collector is receiving spans
If the collector isn't ingesting any trace data, nothing gets processed into metrics:
curl http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans_total
A 0
here means your apps aren’t exporting spans or the collector isn’t properly wired up.
2. Check if spanmetrics are being emitted
Once spans are accepted, they need to be transformed into metrics:
curl http://localhost:8889/metrics | grep jaeger_spm_calls_total
No output means the spanmetrics
connector might be misconfigured or not in the trace pipeline.
3. Check if Prometheus is scraping
Metrics may be generated but never scraped. Confirm that Prometheus is hitting the right endpoint:
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="otel-collector")'
Look for scrape errors or dropped targets.
4. Check if Jaeger can query Prometheus
SPM visualizations in Jaeger UI depend on Jaeger querying Prometheus directly:
curl http://localhost:14269/metrics | grep jaeger_prometheus_query_duration_seconds
If this metric is missing or flatlined, Jaeger may not be able to reach the Prometheus server, due to a wrong PROMETHEUS_SERVER_URL
or network issues.
High Memory Usage in the Collector
The OpenTelemetry Collector is often the first point of failure when dealing with unbounded data, especially from span aggregation.
Symptoms: frequent container restarts, gradual memory creep, or slow metric export.
Here’s how to fix it:
- Apply memory limits to force garbage collection and protect the node:
processors:
memory_limiter:
limit_mib: 512
spike_limit_mib: 128
- Shrink the dimension cache to reduce memory held for high-cardinality lookups:
connectors:
spanmetrics:
dimensions_cache_size: 500 # Lower than the default 1000
- Audit and trim span dimensions. Remove anything high-cardinality unless it's essential:
connectors:
spanmetrics:
dimensions:
- name: service.name
- name: operation
# Avoid: user.id, request.id, trace.id
Gaps or Drops in SPM Metrics
If your dashboards show intermittent gaps, the issue may not be data loss—it could be a mismatch in time or sampling window.
Common causes:
- Clock skew between Jaeger and the collector
- Network flakiness or restart events
- Prometheus querying too narrow a window
Fixes:
- Monitor collector health using its built-in endpoint:
curl http://localhost:13133/
This helps confirm whether the collector is up and exporting.
- Use a longer query window to absorb short outages or clock mismatches:
rate(jaeger_spm_calls_total[10m])
- Increase clock skew tolerance in Jaeger to better align timestamps between trace and metric systems:
command:
- "--query.max-clock-skew-adjustment=60s"
Prometheus Storage Overload
Large trace volumes and high-cardinality metrics can overwhelm Prometheus, especially with default settings.
Symptoms: slow dashboard queries, disk pressure warnings, OOMs.
Remedies:
- Offload to remote storage if you expect cardinality growth:
remote_write:
- url: "https://your-remote-store/api/v1/write"
- Increase Prometheus resource limits to handle ingestion better:
prometheus:
deploy:
resources:
limits:
memory: 4Gi
cpu: 2
- Reduce data retention to shrink the working set and free disk:
command:
- '--storage.tsdb.retention.time=7d'
This also helps with query latency, especially when dashboards default to long lookback ranges.
When in doubt, follow the flow: spans → collector → metrics → Prometheus → Jaeger UI. Debug each layer one at a time, and you’ll quickly find where things are stuck.
Final Thoughts
If you’ve set up SPM with Jaeger, you already know how powerful it is, until Prometheus starts slowing down, dashboards time out, or the collector eats memory trying to keep up. These aren’t rare edge cases. They’re what happens when tracing meets production scale.
Last9 gives you all the benefits of span-derived metrics, RED metrics, trace exemplars, and per-operation breakdowns, without the maintenance headaches.
No TSDB tuning, no guesswork with cardinality, no fragile dashboards.
And getting started is simple: drop in your existing OpenTelemetry setup, point it to Last9, and you’re up in minutes. No need to re-instrument or change your workflow, just better observability, built to scale.
FAQs
Can Jaeger metrics replace my existing APM solution?
Not entirely. Jaeger metrics are excellent for trace-derived insights and operational monitoring, but you'll still need application and infrastructure metrics from other sources for complete observability.
How much overhead does SPM add to my system?
The OpenTelemetry Collector typically adds 2–5% CPU overhead and 100–500MB memory usage, depending on trace volume and cardinality. The benefit usually outweighs the cost.
What's the difference between Jaeger v1 and v2 for metrics?
Jaeger v1 requires a separate OpenTelemetry Collector deployment. Jaeger v2 (based on OpenTelemetry Collector) will have integrated SPM capabilities, simplifying deployment.
Should I use SPM if I already have Prometheus metrics?
SPM provides operation-level granularity that application metrics often miss. If your existing metrics don't capture per-operation performance, SPM adds significant value.
How do I handle high-cardinality metrics without breaking Prometheus?
Limit dimensions to essential ones, use dimension caching, and monitor collector resource usage. For extreme cardinality, consider managed solutions like Last9.
Can I use SPM with Jaeger deployed in Kubernetes?
Yes, but you'll need to configure the OpenTelemetry Collector as a deployment or daemonset, and ensure proper service discovery for Prometheus scraping.
What happens if the OpenTelemetry Collector goes down?
Traces will still reach Jaeger (if configured with multiple exporters), but SPM metrics generation stops. Design for redundancy if SPM is critical.
How long does it take for SPM metrics to appear?
Typically 30–60 seconds from trace ingestion to metric availability, depending on batch processing and export intervals.
Are there any security considerations with Jaeger metrics?
Metrics endpoints expose operational data but not trace content. Still, secure them appropriately and be cautious about dimension values that might contain sensitive information.