Jaeger Metrics: Internal Operations and Service Performance Monitoring

You're monitoring a microservices-based system. Alerts trigger when response times exceed 2 seconds. But when you open Jaeger, you're faced with thousands of traces. Identifying which service or operation is responsible becomes time-consuming.

Jaeger metrics help reduce this friction by exposing aggregated telemetry. Instead of scanning individual traces, you get service-level and operation-level performance metrics, latency, throughput, and error rates that highlight where the issue lies.

This blog covers how to:

Enable and export internal Jaeger metrics via Prometheus
Set up Service Performance Monitoring (SPM) in Jaeger

These provide a faster path from alert to root cause by bridging the gap between high-level monitoring and detailed trace analysis.

Why Jaeger Metrics Matter in Production

In most production setups, debugging with tracing looks like this:

An alert fires → you check your dashboards → you jump into Jaeger to look at traces.

But here’s the problem: traces are granular. They tell you what happened to a request, not what’s happening across requests. If you’ve ever sat staring at hundreds of traces wondering what to even filter for, you’ve felt that gap. That’s where Jaeger metrics help you.

What You Get from Jaeger Metrics

1. Infra-level signals

Jaeger exposes internal metrics about its components, like:

Dropped spans (due to memory limits or misconfig)
Queuing delays
Query latencies
Storage backend issues

Useful when tracing is set up, but you’re not seeing the expected data. These metrics help answer: Is Jaeger receiving and processing spans, or is something choking?

2. Trace-derived RED metrics

Jaeger can also give you:

Request rate
Error rate
Duration (latency)

And not just per-service, you can get operation-level breakdowns. These come directly from trace data, so you get visibility without setting up separate app-level metrics.

Why is this important:

You can spot which endpoints are slowing down before you dig into traces.
You get a sense of baseline latency and volume trends.
Error spikes are easier to correlate with deployments or incidents.

Let’s say latency’s up. Instead of clicking through trace after trace, you can ask:

Which service is contributing the most to p99 latency?
Is this a frontend issue or a downstream service bottleneck?
Are we dropping spans, or is the slowness real?

That’s faster than guessing which tags to filter by.

In short: Traces are great when you know what you’re looking for. Metrics are how you find what to look at. Jaeger metrics give you that missing middle, bridging high-level alerts and low-level trace detail.

💡

If you're comparing tracing backends, this breakdown of Jaeger vs Zipkin covers the trade-offs in depth.

Types of Metrics Jaeger Exposes

Jaeger emits two main categories of metrics — one focused on the tracing infrastructure itself, and the other on application performance using trace data.

1. Internal Metrics: Monitoring Jaeger Itself

These metrics give visibility into the health of the Jaeger components — Collector, Agent, Query, and Storage. They’re essential for making sure your tracing pipeline isn’t silently failing or lagging.

Runtime-level metrics:

Memory and CPU usage
Uptime, garbage collection, and resource usage per process
Binary version info for tracking deployments or mismatches

Operational metrics:

jaeger_collector_spans_received_total: Total spans received
jaeger_collector_spans_saved_total: Successfully written spans
jaeger_collector_spans_dropped_total: Spans dropped due to queue or buffer overflow
jaeger_collector_queue_length: Queue backlog for span processing
jaeger_query_requests_total: Number of requests to the query service

These metrics help answer:

Is Jaeger ingesting and persisting spans properly?
Is anything backing up or dropping under load?
Are users hitting performance issues in the query layer?

When to use:
Always. If Jaeger is in your critical path, you need this visibility to avoid silent failures in your observability system.

2. Service Performance Monitoring: Metrics from Traces

Jaeger can aggregate span data into metrics at both the service and operation levels. These span-derived metrics let you monitor system behavior without instrumenting everything twice.

Service-level metrics:

Request rate: How many spans are emitted per service
Error rate: Percentage of spans with error tags
Latency percentiles: P50, P75, P95 response times

Operation-level metrics:

Same RED metrics, but broken down by operation name
Impact score: Combines latency and volume to surface high-impact endpoints

This lets you track performance over time, detect regressions, and catch issues before they hit alert thresholds.

When to use:
Use these when you need proactive visibility into service behavior, especially useful for teams running multiple microservices, where digging into traces first isn’t scalable.

Decision Tree: Which Setup Do You Need?

Not every team needs the full Jaeger + SPM + OpenTelemetry Collector stack right away. Your tracing setup should match your system’s complexity and your team’s needs. Here’s how to decide.

If you're just getting started with Jaeger, start simple. Focus on internal metrics:

Collector health
Dropped spans
Query performance

These give you the operational confidence that Jaeger itself is running reliably. Before you dive into trace analysis or RED metrics, make sure the tracing system isn’t silently failing.

If you're running Jaeger in production with multiple services, it's time to layer on Service Performance Monitoring (SPM). At this scale, it’s no longer practical to open up individual traces for every issue. SPM lets you:

Catch latency spikes and error trends early
Monitor service-level and operation-level performance
Prioritize what to investigate before querying traces

For teams dealing with high trace volume or complex microservices, go further: use the OpenTelemetry Collector alongside SPM. The Collector helps you:

Buffer and batch spans to avoid overwhelming storage
Apply filters and sampling rules
Generate aggregated metrics efficiently

This setup makes trace data usable at scale, especially when dealing with thousands of spans per second.

Already have detailed metrics from Prometheus or another observability stack?
Then ask: Does SPM add anything new? If your current metrics miss per-operation detail, like latency and error rates broken down by endpoint, then SPM fills that gap.

💡

New to Jaeger? This guide on using Jaeger for distributed tracing covers the basics to get you started.

Setup Internal Metrics with Jaeger

Compatibility: Jaeger 1.35+

Internal metrics give you visibility into Jaeger's performance: whether spans are being dropped, queues are backing up, or queries are getting slow. This setup works across all Jaeger deployment types.

1. Configure Jaeger to Expose Prometheus Metrics

All Jaeger components support Prometheus as a metrics backend. Add these flags to expose metrics on an HTTP endpoint:

# All-in-one (dev/test)
jaeger-all-in-one --metrics-backend=prometheus --metrics-http-route=/metrics

# Production components
jaeger-collector --metrics-backend=prometheus --metrics-http-route=/metrics
jaeger-query     --metrics-backend=prometheus --metrics-http-route=/metrics
jaeger-agent     --metrics-backend=prometheus --metrics-http-route=/metrics

Each component exposes metrics on its own port. For example, jaeger-all-in-one uses the port 14269 for metrics.

2. Docker Compose Example: Jaeger + Prometheus

For local testing or dev environments, here's a minimal setup that wires everything together:

version: '3.8'
services:
  jaeger:
    image: jaegertracing/all-in-one:1.50
    ports:
      - "16686:16686"  # Jaeger UI
      - "14269:14269"  # Prometheus metrics endpoint
    command:
      - "--metrics-backend=prometheus"
      - "--metrics-http-route=/metrics"
    environment:
      - COLLECTOR_OTLP_ENABLED=true
  
  prometheus:
    image: prom/prometheus:v2.40.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'

3. Prometheus Configuration

Configure Prometheus to scrape Jaeger’s metrics endpoint:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'jaeger'
    static_configs:
      - targets: ['jaeger:14269']
    scrape_interval: 5s
    metrics_path: '/metrics'

What to Expect

Once spans are flowing, you’ll see metrics like:

jaeger_collector_spans_received_total
jaeger_collector_spans_dropped_total
jaeger_collector_queue_length
jaeger_query_requests_total

The /metrics endpoint is live as soon as Jaeger starts, but you won’t see meaningful metrics until spans are being processed. If you see only zeros, double-check that your app is sending data.

💡

Not sure where Jaeger fits in with OpenTelemetry? This comparison of OpenTelemetry vs Jaeger lays it out clearly.

Set Up Service Performance Monitoring (SPM)

Compatibility: Jaeger 1.35+, OpenTelemetry Collector Contrib 0.60+

SPM gives you RED metrics (request rate, error rate, duration) directly from trace data, no need to instrument your code again just for metrics. But it requires more moving parts. You’ll need an OpenTelemetry Collector configured to receive spans, generate metrics, and export them to Prometheus.

Why You Need the OpenTelemetry Collector

Jaeger alone stores and serves spans. It doesn’t produce time-series metrics.
To generate metrics from spans, you need something that:

Reads spans
Aggregates them by service and operation
Emits metrics on a schedule

The SpanMetrics connector in the OpenTelemetry Collector does exactly that.

Step 1: Configure the OpenTelemetry Collector

Here’s a minimal otel-collector-config.yaml:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  memory_limiter:
    limit_mib: 1000
    spike_limit_mib: 200
    check_interval: 5s

connectors:
  spanmetrics:
    histogram_buckets: [100us, 1ms, 2ms, 6ms, 10ms, 100ms, 250ms, 500ms, 1s, 2s, 5s]
    dimensions:
      - name: http.method
        default: GET
      - name: http.status_code
        default: "200"
      - name: service.name
      - name: operation
    dimensions_cache_size: 1000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"
    metrics_flush_interval: 30s

exporters:
  otlp:
    endpoint: jaeger:4317
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: "jaeger_spm"
    const_labels:
      service_name: "jaeger_spm"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp, spanmetrics]
    metrics:
      receivers: [spanmetrics]
      processors: [memory_limiter, batch]
      exporters: [prometheus]

  extensions: [health_check]

Notes:

histogram_buckets: Adjust to fit your app’s latency patterns.
dimensions: Only add what you need. More dimensions = more memory.
metrics_flush_interval: Controls how often metrics get exported.
dimensions_cache_size: Prevents unbounded cardinality.

Step 2: Connect Jaeger to Prometheus for Metrics

Jaeger Query needs to be configured to use Prometheus as a backend for metrics. Add the following to the Jaeger service:

environment:
  - COLLECTOR_OTLP_ENABLED=true
  - METRICS_STORAGE_TYPE=prometheus
  - PROMETHEUS_SERVER_URL=http://prometheus:9090
  - PROMETHEUS_QUERY_SUPPORT_SPANMETRICS_CONNECTOR=true
command:
  - "--query.max-clock-skew-adjustment=30s"

The max-clock-skew-adjustment setting helps align timestamps between trace spans and the derived metrics.

Step 3: Use Docker Compose to Run Everything

Here’s a full working setup that runs Jaeger, the OpenTelemetry Collector, and Prometheus together:

version: '3.8'
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:0.88.0
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"
      - "4318:4318"
      - "8889:8889"
      - "13133:13133"
    depends_on:
      - jaeger
      - prometheus
    restart: unless-stopped

  jaeger:
    image: jaegertracing/all-in-one:1.50
    ports:
      - "16686:16686"
      - "14269:14269"
    environment:
      - COLLECTOR_OTLP_ENABLED=true
      - METRICS_STORAGE_TYPE=prometheus
      - PROMETHEUS_SERVER_URL=http://prometheus:9090
      - PROMETHEUS_QUERY_SUPPORT_SPANMETRICS_CONNECTOR=true
    command:
      - "--query.max-clock-skew-adjustment=30s"
    depends_on:
      - prometheus
    restart: unless-stopped

  prometheus:
    image: prom/prometheus:v2.40.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/etc/prometheus/console_libraries'
      - '--web.console.templates=/etc/prometheus/consoles'
      - '--web.enable-lifecycle'
    restart: unless-stopped

Step 4: Set Up Prometheus to Scrape Metrics

Update your Prometheus config to scrape metrics from Jaeger and the OTel Collector:

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'jaeger'
    static_configs:
      - targets: ['jaeger:14269']
    scrape_interval: 5s
    metrics_path: '/metrics'

  - job_name: 'otel-collector'
    static_configs:
      - targets: ['otel-collector:8889']
    scrape_interval: 5s
    metrics_path: '/metrics'

  - job_name: 'otel-collector-internal'
    static_configs:
      - targets: ['otel-collector:8888']
    scrape_interval: 30s
    metrics_path: '/metrics'

What This Setup Gives You

Once everything is connected and spans start flowing:

You’ll see service-level RED metrics: requests per second, error rates, and latency percentiles
You’ll get operation-level breakdowns across dimensions like HTTP method and status
Jaeger’s UI will show span metrics alongside traces, helping you move from high-level patterns to deep request analysis

This setup turns your trace firehose into actionable metrics. If your team manages more than a handful of services, this is where Jaeger becomes more than just a trace viewer; it becomes an observability system.

Understand SPM Metrics in Production

Now that Service Performance Monitoring is set up and spans are flowing through the OpenTelemetry Collector, you'll start seeing a series of jaeger_spm_* metrics in Prometheus. These represent service-level and operation-level behavior based entirely on trace data.

Core SPM Metrics

These are the primary metrics generated by the spanmetrics connector:

jaeger_spm_calls_total: Total number of spans received (used for request rate)
jaeger_spm_duration_bucket: Duration histogram (used for latency percentiles)
jaeger_spm_duration_count: Count of durations (used for average/request rate calculation)
jaeger_spm_duration_sum: Total duration (used to calculate average latency)

Useful PromQL Queries for Dashboards

These are common queries to visualize performance in Grafana or Prometheus dashboards.

Request rate per service (last 5 minutes):

sum(rate(jaeger_spm_calls_total[5m])) by (service_name)

Error rate per service:

sum(rate(jaeger_spm_calls_total{status_code!~"2.."}[5m])) by (service_name)
/
sum(rate(jaeger_spm_calls_total[5m])) by (service_name)

95th percentile latency by service:

histogram_quantile(0.95,
  sum(rate(jaeger_spm_duration_bucket[5m])) by (service_name, le)
)

Top 10 highest-impact operations (latency × throughput):

topk(10,
  histogram_quantile(0.95,
    sum(rate(jaeger_spm_duration_bucket[5m])) by (service_name, operation, le)
  )
  *
  sum(rate(jaeger_spm_calls_total[5m])) by (service_name, operation)
)

How to Interpret the Impact Metric

The “impact” metric is a practical proxy for prioritization. It combines latency and throughput to help identify where optimization will have the greatest effect.

High impact + High latency → Critical optimization target
High impact + Low latency → High-throughput path, optimize for efficiency
Low impact + High latency → Often fine (e.g., batch jobs or admin operations)
Low impact + Low latency → Likely not worth investigating

💡

Writing PromQL for Jaeger metrics? This PromQL cheat sheet can help you query faster and avoid common mistakes.

Use Jaeger’s REST API to Query Metrics

If you’re not using Prometheus directly, or you want to power custom dashboards or tooling without writing PromQL, you can query SPM data through Jaeger’s built-in REST API. These endpoints mirror what you see in the Jaeger UI but are accessible programmatically.

Example API Calls

1. Get request volume by service:

curl "http://localhost:16686/api/metrics/calls?service=frontend&lookback=1h&step=60s"

2. Get error rate by service:

curl "http://localhost:16686/api/metrics/errors?service=frontend&lookback=1h&step=60s"

3. Get latency quantile (e.g., P95):

curl "http://localhost:16686/api/metrics/latencies?service=frontend&lookback=1h&step=60s&quantile=0.95"

4. Get metrics for a specific operation:

curl "http://localhost:16686/api/metrics/calls?service=frontend&operation=GET%20/api/users&lookback=1h"

Pros and Limitations

Pros

Simple to integrate with custom dashboards, CLIs, or scripts
Built-in filtering by service and operation
Output matches what Jaeger UI graphs display

Limitations

Limited flexibility compared to PromQL
No advanced grouping, rate calculations, or histogram quantiles
Lookback windows and step intervals must be explicitly managed

Connect Metrics to Traces for Faster Debugging

Traces tell you what happened. Metrics tell you how often. But real observability comes from linking the two, using metrics to surface problems, then jumping directly into trace data to debug them.

1. Trigger Trace Analysis from Alerts

Set up alerting rules on Jaeger SPM metrics. When a threshold is breached—like a sudden spike in errors—you can jump directly into Jaeger with pre-filtered trace queries.

Example: Alert on high error rate

- alert: HighServiceErrorRate
  expr: |
    sum(rate(jaeger_spm_calls_total{status_code!~"2.."}[5m])) by (service_name) /
    sum(rate(jaeger_spm_calls_total[5m])) by (service_name) > 0.05
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "High error rate for {{ $labels.service_name }}"
    description: "Error rate is {{ $value | humanizePercentage }}"
    jaeger_query: "http://jaeger:16686/search?service={{ $labels.service_name }}&lookback=1h&tags=error%3Atrue"

The jaeger_query annotation can be used in alert receivers or UIs to go directly from an alert to filtered traces.

2. Link Metrics to Traces in Dashboards

Dashboards can do more than visualize metrics. You can wire them up to link back into Jaeger, letting developers move from charts to traces in one click.

Example: Grafana panel with trace links

{
  "title": "Service Performance with Trace Links",
  "panels": [
    {
      "title": "Request Rate by Service",
      "targets": [
        {
          "expr": "sum(rate(jaeger_spm_calls_total[5m])) by (service_name)",
          "legendFormat": "{{service_name}}"
        }
      ],
      "links": [
        {
          "title": "View Traces",
          "url": "http://localhost:16686/search?service=${__field.labels.service_name}&lookback=1h"
        }
      ]
    }
  ]
}

You get quick navigation: a spike in the graph → one click → trace search in Jaeger UI with the correct filters.

3. Correlate Metrics and Traces in Code

When both metrics and traces are emitted from the same instrumentation layer, you gain powerful context: every spike in a metric can be tied to a specific trace or group of traces.

Here’s a simplified example using OpenTelemetry in Python:

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
import time

# Set up tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
span_processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="http://localhost:4317", insecure=True))
trace.get_tracer_provider().add_span_processor(span_processor)

# Set up metrics
metrics.set_meter_provider(MeterProvider(
    metric_readers=[PeriodicExportingMetricReader(
        OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True),
        export_interval_millis=5000
    )]
))
meter = metrics.get_meter(__name__)
request_counter = meter.create_counter("http_requests_total", unit="1")
request_duration = meter.create_histogram("http_request_duration_seconds", unit="s")

# Instrument a request handler
def handle_request(request):
    with tracer.start_as_current_span("handle_request") as span:
        span.set_attribute("http.method", request.method)
        span.set_attribute("http.url", request.url)

        start_time = time.time()
        try:
            response = process_request(request)
            span.set_attribute("http.status_code", response.status_code)

            duration = time.time() - start_time
            request_counter.add(1, {
                "method": request.method,
                "endpoint": request.endpoint,
                "status_code": str(response.status_code)
            })
            request_duration.record(duration, {
                "method": request.method,
                "endpoint": request.endpoint
            })

            return response

        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))

            request_counter.add(1, {
                "method": request.method,
                "endpoint": request.endpoint,
                "status_code": "500"
            })

            raise

When tracing and metrics share labels like method, endpoint, or status_code, the same context exists in both worlds, so jumping between dashboards and traces makes sense.

💡

Running into cardinality issues with span metrics? This guide on what is high cardinality breaks down why it happens and how to keep things manageable.

Handle High Cardinality in SPM Metrics

Service Performance Monitoring (SPM) is powerful, but if misconfigured, it can generate a flood of time series that crush your Prometheus setup. This section walks through why that happens and how to stay in control.

Why Cardinality Explodes

Each unique combination of dimension values in the SpanMetrics connector results in a new time series. It adds up quickly:

10 services × 5 operations × 10 status codes = 500 series
Add user_id → 500 × number of users = millions of series

Some labels (like trace.id, user.id, or request.id) are naturally high-cardinality and should be avoided as dimensions. They’re useful in traces, not in aggregated metrics.

Strategies to Keep Cardinality Manageable

1. Limit Dimensions to What’s Operationally Useful

Stick to dimensions that help during debugging or SLO monitoring. In your OpenTelemetry Collector config:

connectors:
  spanmetrics:
    dimensions:
      - name: service.name
      - name: operation          # Good for route-level visibility
      - name: http.method        # GET, POST, etc.
      - name: http.status_code   # Useful for error patterns
      # Avoid adding user.id, trace.id, or request.id here

Avoid the temptation to over-tag. If a label isn’t going to be part of a query or dashboard filter, don’t promote it to a dimension.

2. Use Caching and Batching Controls

To prevent memory exhaustion in the collector:

connectors:
  spanmetrics:
    dimensions_cache_size: 1000
    aggregation_temporality: "AGGREGATION_TEMPORALITY_CUMULATIVE"

The dimensions_cache_size setting limits how many combinations will be stored in memory. If the cache is full, new time series will be dropped, avoiding OOM errors.

3. Monitor Collector Memory and Series Growth

Keep an eye on the internal metrics exposed by the collector:

curl http://localhost:8888/metrics | grep otelcol_processor_spanmetrics

Watch for indicators like:

High spanmetrics_active_timeseries
Memory spikes in the process metrics

How to Know When You Have a Cardinality Problem

You’ll usually notice cardinality issues in the form of:

OpenTelemetry Collector using more memory over time
Prometheus showing spikes in ingestion rate
Dashboards slowing down or timing out
Storage usage increasing rapidly, even with relatively stable traffic

If any of these happen, cardinality is a likely culprit.

Solutions When Limits Are Hit

Reduce dimensions in the SpanMetrics config. This is the fastest way to cut the series volume.
Apply sampling upstream for services with very high traffic.
Use an observability platform built for high-cardinality workloads.

What to Do When Prometheus Isn’t Enough

If you're already doing all of the above and Prometheus still struggles, consider offloading to a managed backend that’s designed to handle high-cardinality telemetry.

Last9 integrates natively with OpenTelemetry and Prometheus, offering streaming metric aggregation and long-term retention, without blowing up your infra budget.

Teams at Probo, CleverTap, and Replit trust Last9 for better observability and performance. The platform bridges metrics, logs, and traces without the operational complexity of maintaining TSDB tuning, retention policies, or scale-out storage clusters.

Troubleshooting Common Issues

At times, even with a working Jaeger and OpenTelemetry setup, things can break. Dashboards may stop showing data, memory usage might spike, or traces could appear inconsistent.

Here are the most common failure modes and how to fix them:

No Data in the SPM Dashboard

If your SPM dashboard is empty, the issue could be anywhere in the trace-to-metric pipeline. Step through each layer to isolate the problem.

1. Check if the collector is receiving spans

If the collector isn't ingesting any trace data, nothing gets processed into metrics:

curl http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans_total

A 0 here means your apps aren’t exporting spans or the collector isn’t properly wired up.

2. Check if spanmetrics are being emitted

Once spans are accepted, they need to be transformed into metrics:

curl http://localhost:8889/metrics | grep jaeger_spm_calls_total

No output means the spanmetrics connector might be misconfigured or not in the trace pipeline.

3. Check if Prometheus is scraping

Metrics may be generated but never scraped. Confirm that Prometheus is hitting the right endpoint:

curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="otel-collector")'

Look for scrape errors or dropped targets.

4. Check if Jaeger can query Prometheus

SPM visualizations in Jaeger UI depend on Jaeger querying Prometheus directly:

curl http://localhost:14269/metrics | grep jaeger_prometheus_query_duration_seconds

If this metric is missing or flatlined, Jaeger may not be able to reach the Prometheus server, due to a wrong PROMETHEUS_SERVER_URL or network issues.

High Memory Usage in the Collector

The OpenTelemetry Collector is often the first point of failure when dealing with unbounded data, especially from span aggregation.

Symptoms: frequent container restarts, gradual memory creep, or slow metric export.

Here’s how to fix it:

Apply memory limits to force garbage collection and protect the node:

processors:
  memory_limiter:
    limit_mib: 512
    spike_limit_mib: 128

Shrink the dimension cache to reduce memory held for high-cardinality lookups:

connectors:
  spanmetrics:
    dimensions_cache_size: 500  # Lower than the default 1000

Audit and trim span dimensions. Remove anything high-cardinality unless it's essential:

connectors:
  spanmetrics:
    dimensions:
      - name: service.name
      - name: operation
      # Avoid: user.id, request.id, trace.id

Gaps or Drops in SPM Metrics

If your dashboards show intermittent gaps, the issue may not be data loss—it could be a mismatch in time or sampling window.

Common causes:

Clock skew between Jaeger and the collector
Network flakiness or restart events
Prometheus querying too narrow a window

Fixes:

Monitor collector health using its built-in endpoint:

curl http://localhost:13133/

This helps confirm whether the collector is up and exporting.

Use a longer query window to absorb short outages or clock mismatches:

rate(jaeger_spm_calls_total[10m])

Increase clock skew tolerance in Jaeger to better align timestamps between trace and metric systems:

command:
  - "--query.max-clock-skew-adjustment=60s"

Prometheus Storage Overload

Large trace volumes and high-cardinality metrics can overwhelm Prometheus, especially with default settings.

Symptoms: slow dashboard queries, disk pressure warnings, OOMs.

Remedies:

Offload to remote storage if you expect cardinality growth:

remote_write:
  - url: "https://your-remote-store/api/v1/write"

Increase Prometheus resource limits to handle ingestion better:

prometheus:
  deploy:
    resources:
      limits:
        memory: 4Gi
        cpu: 2

Reduce data retention to shrink the working set and free disk:

command:
  - '--storage.tsdb.retention.time=7d'

This also helps with query latency, especially when dashboards default to long lookback ranges.

When in doubt, follow the flow: spans → collector → metrics → Prometheus → Jaeger UI. Debug each layer one at a time, and you’ll quickly find where things are stuck.

💡

Now, surface trace-derived metrics and service performance data from production, directly in your debugging workflow, with Last9 MCP. Get real-time context across spans, RED metrics, and infrastructure signals to troubleshoot faster.

Final Thoughts

If you’ve set up SPM with Jaeger, you already know how powerful it is, until Prometheus starts slowing down, dashboards time out, or the collector eats memory trying to keep up. These aren’t rare edge cases. They’re what happens when tracing meets production scale.

Last9 gives you all the benefits of span-derived metrics, RED metrics, trace exemplars, and per-operation breakdowns, without the maintenance headaches.

No TSDB tuning, no guesswork with cardinality, no fragile dashboards.

And getting started is simple: drop in your existing OpenTelemetry setup, point it to Last9, and you’re up in minutes. No need to re-instrument or change your workflow, just better observability, built to scale.

FAQs

Can Jaeger metrics replace my existing APM solution?
Not entirely. Jaeger metrics are excellent for trace-derived insights and operational monitoring, but you'll still need application and infrastructure metrics from other sources for complete observability.

How much overhead does SPM add to my system?
The OpenTelemetry Collector typically adds 2–5% CPU overhead and 100–500MB memory usage, depending on trace volume and cardinality. The benefit usually outweighs the cost.

What's the difference between Jaeger v1 and v2 for metrics?
Jaeger v1 requires a separate OpenTelemetry Collector deployment. Jaeger v2 (based on OpenTelemetry Collector) will have integrated SPM capabilities, simplifying deployment.

Should I use SPM if I already have Prometheus metrics?
SPM provides operation-level granularity that application metrics often miss. If your existing metrics don't capture per-operation performance, SPM adds significant value.

How do I handle high-cardinality metrics without breaking Prometheus?
Limit dimensions to essential ones, use dimension caching, and monitor collector resource usage. For extreme cardinality, consider managed solutions like Last9.

Can I use SPM with Jaeger deployed in Kubernetes?
Yes, but you'll need to configure the OpenTelemetry Collector as a deployment or daemonset, and ensure proper service discovery for Prometheus scraping.

What happens if the OpenTelemetry Collector goes down?
Traces will still reach Jaeger (if configured with multiple exporters), but SPM metrics generation stops. Design for redundancy if SPM is critical.

How long does it take for SPM metrics to appear?
Typically 30–60 seconds from trace ingestion to metric availability, depending on batch processing and export intervals.

Are there any security considerations with Jaeger metrics?
Metrics endpoints expose operational data but not trace content. Still, secure them appropriately and be cautious about dimension values that might contain sensitive information.