Last9

Jaeger Monitoring: Essential Metrics and Alerting for Production Tracing Systems

Monitor Jaeger in production with core metrics and alerting rules, track trace completion, queue depth, and storage performance at scale.

Aug 1st, ‘25
Jaeger Monitoring: Essential Metrics and Alerting for Production Tracing Systems
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

Your Jaeger setup is running. Traces are coming in, and the UI is helping you spot slow services or debug broken flows.

But just like any part of your observability stack, Jaeger needs some basic monitoring to stay reliable. If the collector starts queueing spans or the agent runs out of buffer, it can lead to dropped traces, sometimes without any obvious sign in the UI.

This blog focuses on the operational side of Jaeger:

  • Which metrics are worth watching
  • How to identify issues like queue overflows or dropped spans early
  • Simple alerting setups that help keep tracing dependable

Key Jaeger Infrastructure Metrics

Jaeger is also a set of services that need to stay healthy to make tracing reliable. Monitoring should focus on the core components: collectors, query services, and storage backends. Each one fails in different ways, and those failures affect trace visibility differently.

1. Jaeger Collector Metrics

The collector handles trace ingestion. If it gets overwhelmed or misconfigured, trace data can be lost before it’s ever saved. These metrics help track collector health and throughput.

1.1 Span Reception and Processing Rates

# Rate of spans received by collectors
rate(jaeger_collector_spans_received_total[5m])

# Rate of traces successfully saved
rate(jaeger_collector_traces_saved_total[5m])

# Percentage of received spans that result in a complete trace
(
  rate(jaeger_collector_traces_saved_total[5m]) /
  rate(jaeger_collector_spans_received_total[5m])
) * 100

If the completion rate drops below 95%, some traces aren’t making it through. This usually points to issues with the storage backend, network timeouts, or CPU limits on the collectors.

1.2 Queue Depth and Backpressure

# Current queue length in each collector
jaeger_collector_queue_length

# Collector queue usage as a percentage of total capacity
(jaeger_collector_queue_length / jaeger_collector_queue_capacity) * 100

When queue utilization crosses 70%, it’s a signal that collectors are falling behind. At that point, it’s worth checking CPU, memory, and storage pressure—or scaling out collectors to absorb the incoming span load.

💡
For a deeper look at what each Jaeger metric means and how to use them, check out this detailed breakdown.

2. Storage Backend Metrics

Your storage backend affects two things: how fast traces are ingested, and how quickly they can be queried. If writes start lagging, collectors back up. If reads slow down, the UI becomes unusable during incident triage.

The exact metrics depend on whether you're using Elasticsearch or Cassandra.

2.1 Elasticsearch: Key Metrics

indices.indexing.index_time_in_millis   # Write latency  
indices.search.query_time_in_millis     # Query latency  
cluster.status                          # Overall health

Jaeger’s default Elasticsearch setup creates daily indices for trace data. That pattern can cause indexing delays as the cluster grows. Monitor disk usage, shard allocation, and indexing throughput, especially if you retain traces for more than a few days.

2.2 Cassandra: Key Metrics

write_latency_99th_percentile      # Ingestion performance  
read_latency_99th_percentile       # Query performance  
pending_compactions                # Compaction backlog

Cassandra handles high write volumes well, but background operations like compaction can stall ingestion if they fall behind. Watch for spikes in pending_compactions, and monitor tombstone accumulation if you’re running frequent deletes or TTL-based cleanup.

3. Query Service Metrics

The query service is what developers interact with, either through the Jaeger UI or APIs. If it's slow, your traces are technically there, but hard to use.

3.1 Key Prometheus Metrics

# P95 latency for trace queries
histogram_quantile(0.95, 
  rate(jaeger_query_request_duration_seconds_bucket[5m])
)

# Query success rate
rate(jaeger_query_requests_total{status_code="200"}[5m]) /
rate(jaeger_query_requests_total[5m])

If query latency goes over 5 seconds consistently, check the storage backend. It usually points to slow disk I/O, overloaded shards, or inefficient trace queries. You can also break down by query type to find out which patterns are most expensive and tune accordingly.

Alert Definitions for Core Jaeger Metrics

Monitoring tells you what’s happening, and alerts help you act on it. These alert definitions focus on early signals that affect trace ingestion, completeness, and query responsiveness. They’re designed to catch infrastructure pressure before it impacts trace availability or developer workflows.

High-Priority Alerts

1. Drop in Trace Completion Rate

- alert: JaegerTraceDataLoss
  expr: (
    rate(jaeger_collector_traces_saved_total[5m]) /
    rate(jaeger_collector_spans_received_total[5m])
  ) < 0.95
  for: 2m
  annotations:
    summary: "Jaeger trace completion rate below threshold"
    description: "Less than 95% of spans are being stored successfully"

This alert measures the percentage of received spans that result in completed traces. If the ratio drops below 95%, it could point to slow writes, resource constraints in collectors, or configuration issues in sampling or ingestion.

2. Collector Queue Utilization Exceeds 80%

- alert: JaegerCollectorQueueHigh
  expr: jaeger_collector_queue_length > (jaeger_collector_queue_capacity * 0.8)
  for: 1m
  annotations:
    summary: "Jaeger collector queue usage above 80%"
    description: "Queue utilization is {{ $value }}% of total capacity"

This alert tracks how full the collector’s internal queue is. Sustained high usage suggests the system is processing traces slower than they’re arriving. It's often a sign that scaling adjustments or backend checks may be needed.

Medium-Priority Alerts

3. Increased Latency in Query Responses

- alert: JaegerQuerySlow
  expr: histogram_quantile(0.95, 
    rate(jaeger_query_request_duration_seconds_bucket[5m])
  ) > 10
  for: 5m
  annotations:
    summary: "High latency in Jaeger query responses"
    description: "P95 query time is {{ $value }}s"

This alert helps identify slow performance in the query service. Long response times can be caused by wide time ranges, storage load, or suboptimal trace indexing. While it doesn’t affect ingestion, it can delay trace access during investigations.

4. Elevated Write Latency in Storage Backend

- alert: JaegerStorageLatencyHigh
  expr: elasticsearch_indices_indexing_index_time_in_millis > 500
  for: 3m
  annotations:
    summary: "Write latency detected in storage backend"
    description: "Elasticsearch indexing time is {{ $value }}ms"

High indexing latency in the storage backend can slow down trace ingestion and eventually lead to collector queue buildup. This alert helps surface early storage pressure so adjustments can be made before it affects overall trace flow.

💡
If you're comparing tracing backends, this Jaeger vs. Zipkin guide breaks down differences in architecture and operations.

Estimate Storage Requirements Based on Trace Volume

Jaeger’s storage footprint depends on how much trace data you send, how often you sample, and how large each span is. When traces are retained for weeks or longer, even small underestimations can lead to storage pressure down the line.

To get a reasonable estimate, account for:

  • Number of instrumented services
  • Average request rate per service
  • Average spans per trace
  • Sampling rate
  • Average span payload size (tags, logs, attributes, etc.)

Here’s the general formula:

Daily Trace Storage = 
(requests/day) × (sampling rate) × (spans per trace) × (avg span size)

Example:
A system with:

  • 1 million requests per day
  • 1% sampling rate
  • 10 spans per trace
  • 2 KB per span

Would generate:

1,000,000 × 0.01 × 10 × 2 KB = 200 GB/day

Use this estimate to set retention limits, budget for disk usage, or validate actual backend growth over time. If usage patterns shift—like a sudden spike in traffic or a change in span tagging—update the numbers and adjust storage thresholds accordingly.

Scale Collectors Based on Queue Pressure

Collector performance depends on how quickly it can accept spans and push them to storage. If storage becomes a bottleneck or traffic increases beyond current capacity, collector queues begin to fill.

To track this:

avg(jaeger_collector_queue_length) / avg(jaeger_collector_queue_capacity) > 0.6

When queue usage exceeds 60% consistently, it’s a sign that ingestion is slower than it should be. You can either scale collectors horizontally or investigate storage performance issues. Co-locating collectors with application services can also help by reducing delivery latency and network overhead.

Identify Slow Queries Using Access Patterns

While the Jaeger query service is stateless, its responsiveness is tied directly to the trace storage backend. If queries start slowing down, it's usually due to long lookback windows, inefficient indexing, or unbounded searches.

To surface the most expensive operations:

topk(5, sum by (operation) (
  rate(jaeger_query_request_duration_seconds_sum[1h]) /
  rate(jaeger_query_request_duration_seconds_count[1h])
))

This shows which operations have the highest average query duration over the past hour. It’s a solid starting point for:

  • Optimizing trace indexing strategy
  • Detecting UI query bottlenecks during high traffic
  • Moving long-tail queries to a more cost-efficient backend

Advanced Monitoring Patterns Using Jaeger Metrics

Once core metrics like collector health and storage latency are in place, Jaeger can offer deeper operational insight. These patterns focus on trace quality, dependency stability, and whether your sampling strategy is working as intended.

Measure Trace Quality Across Services

A trace isn’t always complete. If it’s missing spans or lacks detail, it won’t help during debugging. These metrics help surface gaps in instrumentation and highlight where span volume may be too low:

# Average spans per trace
rate(jaeger_collector_spans_received_total[5m]) /
rate(jaeger_collector_traces_received_total[5m])

# Ratio of spans with error tags
rate(jaeger_collector_spans_received_total{error="true"}[5m]) /
rate(jaeger_collector_spans_received_total[5m])

Low spans-per-trace numbers often point to services with incomplete instrumentation. A spike in error-tagged spans can indicate service degradation before it shows up in logs or user-facing metrics.

Identify Failing Service Dependencies

Jaeger client spans show how services interact with downstream dependencies. If a service is frequently seeing failures when making outbound calls, that’s often visible in these spans:

# Error rate for client spans by service
sum by (service_name) (
  rate(jaeger_spans{status="error", span_kind="client"}[5m])
)
/
sum by (service_name) (
  rate(jaeger_spans{span_kind="client"}[5m])
)

This ratio helps surface problematic dependencies, whether it's an unstable internal service, a third-party API, or a misconfigured endpoint. Tracking these patterns over time can help reduce cascading failures in distributed systems.

Validate Sampling Behavior Per Service

Sampling helps manage cost and system load, but it only works if the right traces are captured. Comparing span ingestion volume against total request volume is a practical way to verify sampling behavior:

# Effective sampling rate per service
rate(jaeger_collector_spans_received_total[5m]) /
rate(application_requests_total[5m])

If this ratio deviates significantly from your intended sampling rate, it could indicate issues like:

  • Misconfigured collectors or agents
  • Traffic imbalances across instances
  • Unintentional trace drops due to queue pressure or resource limits

Monitoring this across services helps ensure you're capturing representative trace data from the systems that matter most.

💡
If you're evaluating open source tracing tools, this Grafana Tempo vs. Jaeger comparison covers key trade-offs.

Operational Runbooks for Jaeger Infrastructure

When Jaeger alerts fire, having clear response steps helps resolve issues quickly and prevents trace loss. These runbooks cover common production scenarios: trace ingestion failures, slow queries, and storage pressure.

1. Trace Data Loss

If traces aren’t showing up or completion rates drop:

  • Check collector status: Ensure collectors are up, reachable, and not restarting. Look for OOM or backpressure logs.
  • Check storage writes: For Elasticsearch, inspect indexing.index_time_in_millis. For Cassandra, check write latency and pending_compactions.
  • Verify connectivity: Make sure application agents can reach collectors—especially in multi-AZ or multi-region setups.
  • Check sampling config: Confirm sampling strategy and rate haven’t changed in deployment configs.
  • Check queue usage: If queue utilization is >80%, scale collectors or investigate storage bottlenecks.

2. Jaeger Query Latency

If the UI or API feels slow:

  • Find slow queries: Use histogram_quantile and topk to surface high-latency operations by service.
  • Check backend metrics: Elasticsearch: shard state, search.query_time_in_millis. Cassandra: read latency, SSTable counts.
  • Review retention windows: Long lookbacks (7+ days) can slow down poorly indexed workloads.
  • Optimize indexes: Tune TTLs, mappings, and trace ID strategies. Consider span-level vs. trace-level indexes based on query patterns.

3. Storage Capacity and Growth

To avoid surprises as trace volume grows:

  • Monitor growth rate: Chart disk usage per index/table. Compare against expected daily ingestion.
  • Review retention policy: For Elasticsearch, check ILM. For Cassandra, confirm TTLs are applied and cleanup is active.
  • Plan ahead: Scale storage before usage exceeds 75%.
  • Use storage tiers: Partition by time (e.g., daily indices). Keep recent traces on fast storage; archive older data separately.

Make Jaeger Monitoring Operationally Actionable

The purpose of monitoring Jaeger isn't just uptime; it's making sure that trace data is available, complete, and usable when something breaks in production. The signal you're looking for isn't just "Jaeger is up," but rather, "Are we losing traces that we’ll need during a future incident?"

To make that actionable, you can structure dashboards and alerting systems to surface trace quality, ingestion bottlenecks, and backend saturation in real time.

Design Dashboards for Operational Clarity

A good Jaeger dashboard doesn’t just expose metrics, it answers specific operational questions. Prioritize visualizations that reveal failure modes and help you decide when to scale, tune, or investigate.

Key components to include:

  • End-to-End Ingestion Flow:
    Visualize span throughput at each hop—agent → collector → storage. Use counters for spans_received_total, traces_saved_total, and backend-specific write rates to validate trace flow integrity.
  • Completion Rate Heatmaps:
    Show per-service trace completion rate with thresholds (e.g., <95%) to surface under-instrumented services or ingestion pressure.
  • Collector Queue Pressure Over Time:
    Line charts showing queue depth and capacity per collector node, annotated with deployment or traffic changes.
  • Query Service Load:
    Use request duration histograms and count-per-operation views to spot sudden shifts in usage patterns (e.g., burst queries during an incident).
  • Storage Performance and Saturation:
    Show write latency trends (index_time_in_millis, write_latency_99th_percentile), compaction lag, and disk usage forecasts.

Connect Jaeger Metrics to Incident Workflows

Jaeger should be embedded in your broader incident response and postmortem process:

During an incident:

  • Validate Trace Coverage:
    Check recent trace volume and span counts for affected services. Use traces_saved_total and spans_received_total to confirm that tracing is functional and representative during the incident window.
  • Compare Trace Gaps with Errors/Latency:
    Correlate trace loss or queue saturation with error spikes, increased request latency, or downstream service failures. This often helps identify the first point of pressure in the request path.
  • Ensure Trace Visibility for On-Call Teams:
    Confirm that monitoring systems and dashboards can access trace data even during partial outages (e.g., degraded storage or query service). Include Jaeger health status in Slack alerts or PagerDuty runbooks.

Integrate Jaeger with Existing Observability Systems

Jaeger metrics are valuable on their own, but they’re much more effective when viewed alongside logs, metrics, and other telemetry signals. Instead of managing tracing in isolation, pull Jaeger’s internal metrics into the same workflows you already use for alerting and dashboarding.

With Last9, you can track Jaeger’s health right alongside your application metrics, without juggling separate tools or dashboards. Trusted by teams at Probo, CleverTap, Replit, and more, the platform monitors trace ingestion, queue pressure, and service performance through a single, unified interface.

High-cardinality signals, like per-service queue depth, dropped spans, or trace completion rates, are handled natively at scale. That means better visibility across your systems, without the usual trade-offs in performance or cost.

Last9 Review
Last9 Review

Correlate Trace Health with Application Signals

Once Jaeger metrics are part of your main telemetry pipeline, you can start asking higher-level questions:

# Application error spikes aligned with trace loss
(
  rate(application_errors_total[5m])
)
and
(
  rate(jaeger_collector_traces_saved_total[5m]) /
  rate(jaeger_collector_spans_received_total[5m]) < 0.95
)

This type of correlation helps you understand:

  • Are trace drops happening during error spikes?
  • Is a traffic surge affecting both app behavior and collector queues?
  • Does the sampling rate need adjustment during peak hours?

Bringing trace infrastructure into your main observability flow gives you better visibility, fewer blind spots, and faster incident response.

Get started with us for free today, or book a time with us to know more about the product capabilities!

FAQs

Q: What is the difference between Prometheus and Jaeger? A: Prometheus collects and stores time-series metrics (counters, gauges, histograms) while Jaeger focuses on distributed traces that show request flows across services. Prometheus answers "what happened" with metrics, while Jaeger shows "how it happened" with detailed request paths.

Q: What is Jaeger used for? A: Jaeger traces requests as they flow through distributed systems, helping developers identify performance bottlenecks, debug errors, and understand service dependencies. It's particularly valuable for microservices architectures where a single user request touches multiple services.

Q: What is the Jaeger system? A: Jaeger is an open-source distributed tracing platform with several components: agents that collect traces from applications, collectors that process and store trace data, query services that retrieve traces, and a web UI for visualization. It typically uses backends like Elasticsearch or Cassandra for storage.

Q: How does a Jaeger work? A: Applications send trace spans to Jaeger agents, which forward them to collectors. Collectors validate, process, and store traces in a backend database. The query service retrieves traces on demand, and the UI displays them as visual timelines showing request flows and timing across services.

Q: Should You Use OpenTelemetry And Jaeger? A: Yes, OpenTelemetry provides vendor-neutral instrumentation APIs while Jaeger handles trace storage and visualization. This combination gives you flexibility—you can instrument once with OpenTelemetry and send traces to Jaeger or other backends without changing application code.

Q: What is Kubernetes? A: Kubernetes is a container orchestration platform that automates deployment, scaling, and management of containerized applications. Many teams run Jaeger on Kubernetes using operators or Helm charts, taking advantage of Kubernetes' service discovery and scaling capabilities.

Q: Can Jaeger Show Metrics? A: Jaeger focuses on traces, not metrics, but it can derive basic metrics from trace data like request rates and error percentages. For comprehensive metrics, pair Jaeger with dedicated metrics systems like Prometheus or use observability platforms that combine both.

Q: What is Jaeger Tracing? A: Jaeger tracing refers to the practice of instrumenting applications to send trace data to Jaeger. A trace represents a single request's journey through your system, composed of spans that represent individual operations within services.

Q: What tools to use for distributed tracing with opentelemetry? A: Popular options include Jaeger for visualization and storage, Zipkin as an alternative backend, and cloud services like AWS X-Ray. Many teams also use observability platforms like Last9 that provide managed tracing with OpenTelemetry integration.

Q: What is the difference between Zipkin and Jaeger tracing? A: Both are open-source distributed tracing systems with similar capabilities. Jaeger offers more advanced features like adaptive sampling and better Kubernetes integration, while Zipkin has simpler deployment requirements. Jaeger generally handles high-volume production workloads better.

Q: How do I set up Jaeger for monitoring microservices? A: Start by deploying Jaeger's components (collector, query service, storage), then instrument your services using OpenTelemetry SDKs. Configure sampling rates appropriate for your traffic volume, set up dashboards for key metrics, and establish retention policies for trace data storage.

Authors
Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I

Contents

Do More with Less

Unlock high cardinality monitoring for your teams.