Last9

Histogram Buckets in Prometheus Made Simple

Learn how Prometheus histogram buckets work, why they matter, and how to fine-tune them for better observability and smarter alerting.

Apr 14th, ‘25
Histogram Buckets in Prometheus Made Simple
See How Last9 Works

Unified observability for all your telemetry. Open standards. Simple pricing.

Talk to an Expert

Ever seen an average latency of 200ms on your dashboard while users are still hitting timeouts? That disconnect usually points to one thing: your metrics aren’t telling the full story.

Request durations, payload sizes, and other performance data rarely follow clean, predictable patterns. Averages flatten the spikes, hiding the outliers that often matter most in production.

Prometheus histograms offer a better approach. They let you track how values are distributed across fast, slow, and painfully slow responses. But getting value out of histograms is about choosing the right bucket boundaries, understanding how queries like histogram_quantile() work, and avoiding the common pitfalls that come with high-cardinality setups.

This blog walks through how histogram buckets work, how to configure them properly, and how to use them to surface real performance issues.

Understanding Prometheus Histogram Buckets

Prometheus histograms are used to capture the distribution of observed values across a set of predefined thresholds. Unlike counters or gauges, which give you totals or point-in-time values, histograms let you ask: how many requests fell under 100ms, how many under 500ms, and how many were slower?

This is critical when tracking metrics like HTTP request durations, payload sizes, or queue processing times—anything where the range and shape of the data matters more than a simple average.

How Histogram Buckets Work

A histogram metric in Prometheus is made up of three components:

  • *_bucket{le="<upper_bound>"} — a counter for each bucket, showing how many values were less than or equal to that threshold
  • *_sum — the total sum of all recorded values
  • *_count — the total number of observations

For example, if you define buckets at 0.1, 0.5, and 1.0 seconds for http_request_duration_seconds, you’ll get cumulative counts of how many requests completed in:

  • ≤ 100ms
  • ≤ 500ms
  • ≤ 1s

Prometheus automatically generates all three components when you instrument a histogram metric. These raw series form the basis for percentile estimation using PromQL functions like histogram_quantile()—which we’ll cover shortly.

Why Histogram Buckets Are Important in Production

Histogram buckets are essential when averages stop telling the truth.

In production, what breaks user experience isn’t the average latency, it’s the outliers. Histograms help expose those long-tail behaviors that typical metrics flatten out.

What You Get from Histograms

  • Outlier visibility: A service with a 300ms average might still have 1% of requests taking 5+ seconds. Histograms surface that tail.
  • SLO accuracy: You can define SLOs at the 95th or 99th percentile—based on real distribution data, not just mean values.
  • On-the-fly percentiles: Use histogram_quantile() in PromQL to calculate percentiles dynamically, without needing to predefine them at ingest time.
  • Trend detection: Spot slow drifts in performance, like a creeping p95 latency, even when the average looks stable.
  • Capacity planning: Understand how request durations shift under load, helping you plan for scaling and throttling.
  • User experience correlation: Link slow responses to specific parts of the user journey by breaking down latency into time bands.

Example:

Let’s say your dashboard shows a 1.2s average page load time on your e-commerce site; it seems fine at first glance. But histogram data reveals that during peak traffic, 10% of checkout requests take over 4 seconds. That delay directly maps to a spike in cart abandonment.

Without histogram buckets, this insight is lost. You’d be optimizing for the wrong thing, fixing what looks fine, while ignoring what’s hurting users.

💡
If you're still getting comfortable with the different Prometheus metric types, this guide breaks them down with clear examples to help you pick the right one for the job.

Setting Up Prometheus Histograms

Quick Start: Deploy Histogram Buckets

If you're tracking latency and need meaningful histograms now, this setup gives you reliable visibility without blowing up your cardinality budget.

Example: Bucket Configuration for REST APIs (Go + Prometheus client)

responseTime := promauto.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Help:    "HTTP request duration in seconds",
    Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1, 5, 10}, // 10ms to 10s range
})

func handleRequest(endpoint string, statusCode int) {
    start := time.Now()
    defer func() {
        duration := time.Since(start).Seconds()
        responseTime.WithLabelValues(endpoint, getStatusClass(statusCode)).Observe(duration)
    }()
    // Handle request
}

func getStatusClass(code int) string {
    switch {
    case code < 300:
        return "2xx"
    case code < 400:
        return "3xx"
    case code < 500:
        return "4xx"
    default:
        return "5xx"
    }
}

Alerting on Latency Spikes

- alert: HighLatencyP95
  expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5
  for: 10m

Why This Works

  • Bucket Coverage: Ranges from ultra-fast cache hits (~10ms) to severe slowdowns (~10s), with higher resolution between 50ms–500ms—ideal for web APIs.
  • Manageable Cardinality: Two dimensions (endpoint, status_class) keep time series counts under control while still offering useful drill-downs.
  • Ready for Aggregation: Histogram data can be combined across services or regions for accurate percentiles at scale.

This setup provides observability that scales with your application—without tuning knobs every sprint.

Implementation Examples with Go and Python

You can start using histograms with just a few lines of code. Prometheus client libraries make it easy to define custom buckets and record observations. Here’s how it works.

Example in Go

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
    "time"
)

// Define a histogram with custom bucket boundaries
var responseTimeHistogram = promauto.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_duration_seconds",
    Help:    "HTTP request duration in seconds",
    Buckets: []float64{0.1, 0.3, 0.5, 0.7, 1.0, 2.0, 5.0, 10.0}, // in seconds
})

func handleRequest() {
    start := time.Now()
    // ... handle the request ...
    duration := time.Since(start).Seconds()
    responseTimeHistogram.Observe(duration)
}

This setup tracks how many requests fall into each time range—100ms, 300ms, 500ms, and so on. Prometheus automatically updates the appropriate buckets whenever you call Observe().

If you don't want to define buckets manually, you can also use Prometheus' default set:

var responseTimeHistogram = promauto.NewHistogram(prometheus.HistogramOpts{
    Name: "http_request_duration_seconds",
    Help: "HTTP request duration in seconds",
    // Uses default buckets: [0.005, 0.01, 0.025, ..., 10]
})

Example in Python

from prometheus_client import Histogram
import time

# Define a histogram with custom buckets
REQUEST_TIME = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration in seconds',
    buckets=[0.1, 0.3, 0.5, 0.7, 1.0, 2.0, 5.0, 10.0]
)

def process_request():
    start = time.time()
    # ... handle the request ...
    duration = time.time() - start
    REQUEST_TIME.observe(duration)

Both examples follow the same pattern:

  1. Define the histogram with meaningful bucket ranges.
  2. Measure the duration.
  3. Record the observation using Observe().

This gives Prometheus everything it needs to track request distributions and calculate percentiles later using PromQL.

💡
To get more out of your histogram data, it helps to know the Prometheus functions that work best with them, like rate(), histogram_quantile(), and friends.

Configure Histogram Buckets

The Right Histogram Buckets

Bucket boundaries have a big impact on how useful your histogram data turns out to be. The goal is to capture enough detail to support real analysis, without adding unnecessary overhead or noise.

Here’s a structured way to think about it.

Start with Your SLOs

Begin with what matters for your service.

  • If your SLO is “99% of requests under 300ms,” include buckets around 200ms, 300ms, and 400ms.
  • For multiple SLOs (like 95% under 200ms and 99% under 500ms), define boundaries that cover each range.

This makes it easier to evaluate how close you are to SLO thresholds using percentiles like p95 or p99.

Consider User-Perceived Latency

Latency impacts users differently depending on how long they wait. These general ranges often map well to how delays are perceived:

  • ≤ 100ms — feels instant
  • 100–300ms — noticeable but fine
  • 300–1000ms — introduces some friction
1s — often feels broken

Including buckets around these thresholds helps bridge the gap between system metrics and user experience.

Use Logarithmic or Exponential Scales

Uniform bucket spacing can miss important patterns. Instead, use exponential or logarithmic spacing to capture a wider range with more detail where needed.

Examples:

[]float64{0.0625, 0.125, 0.25, 0.5, 1, 2, 4, 8}

Or base-10 with intermediate steps:

[]float64{0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10}

Add Resolution Where It Matters

If your SLO is strict, say, 99% under 300ms, it’s helpful to add more granularity around that point:

[]float64{0.25, 0.275, 0.3, 0.325, 0.35}

This makes it easier to see small shifts that could push you past your threshold.

A Practical Starting Point

If you're looking for a default set that works well for most web applications, this one covers both fast and slow responses:

[]float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}

It’s balanced enough to catch both quick successes and longer outliers, without overwhelming your storage or query engine.

Which Buckets Should I Use?

Here's what works based on service type:

Service Type Bucket Configuration Why These Work Alert Threshold
REST API [0.01, 0.05, 0.1, 0.5, 1, 5, 10] Covers fast responses to timeouts p95 > SLO
Database [0.001, 0.01, 0.1, 0.5, 1, 10, 60] Cache hits to analytical queries p95 > 100ms
Queue Processing [0.1, 1, 5, 30, 300, 1800] Event processing to batch jobs p99 > timeout
File Operations [0.1, 1, 5, 30, 120, 600] Small files to large uploads p95 > user patience
External API Calls [0.05, 0.2, 1, 5, 15, 30] Network calls with timeouts p99 > circuit breaker

Service-specific refinements:

  • High SLO requirements (p99 < 100ms)? Add buckets at [0.025, 0.075, 0.125]
  • Long-running operations? Extend upper range: [..., 1800, 3600, 7200]
  • Microservices with external deps? Focus mid-range: [0.1, 0.5, 2, 10]

Start with the standard config for your service type, run it for a week, then check your actual p95/p99 values and adjust bucket density around those ranges.

💡
Now, debug high-latency issues with real production context, directly from your local setup. With Last9 MCP, pull in live metrics, traces, and histograms to identify slow endpoints, pinpoint regressions, and validate fixes without waiting for staging or redeploys.

Common Mistakes When Using Histogram Buckets

Histograms are powerful, but misconfigured buckets can quietly cause serious problems—from bloated storage to misleading insights. Here are the most common pitfalls and how to avoid them.

1. Using Too Many Buckets

Adding more buckets might seem like it gives better visibility—but in practice, it often leads to:

  • Excessive time series creation
  • Higher memory and CPU usage
  • Slower queries
  • Increased storage costs

For example, using 50 buckets across dozens of endpoints quickly results in thousands of active series. Most setups don’t need that level of granularity. A more practical approach is to use 10–15 buckets focused on key thresholds (like SLO boundaries and tail latency cutoffs).

  1. Wrong Bucket Strategy

Linear buckets (e.g., 100ms, 200ms, 300ms…) rarely align with the shape of latency or payload distributions, which are usually skewed or heavy-tailed.

Using linear buckets can:

  • Overrepresent rarely occurring values
  • Miss details in the critical lower range
  • Flatten useful patterns

Exponential buckets (e.g., 5ms, 10ms, 50ms, 100ms, 500ms…) offer better resolution where it matters, especially around the 95th or 99th percentile.

3. Incomplete Range Coverage

If the upper bound of the histogram is too low, any values beyond that range are grouped into the final bucket, which hides long-tail behavior.

For example, if the largest bucket is 5 seconds, a 15-second timeout gets lumped into the same category. This makes it hard to identify slowdowns or incidents that fall outside the “expected” range.

To avoid this, the largest bucket should comfortably cover at least 2–3× the maximum expected latency.

4. Skewed Bucket Placement

It’s common to define buckets only for the "normal" case, below the SLO target. That leaves no visibility into regressions.

When all buckets end below a 1-second target, any deviation looks flat or capped. This masks performance degradation until it's too late.

Define buckets that extend well past your SLO thresholds. This helps detect early signs of drift before it becomes an incident.

5. High-Cardinality Labels

Each histogram bucket is multiplied across all label combinations. Labels like user_id, session_id, or query_hash can cause a combinatorial explosion in the number of time series.

This leads to:

  • High memory usage
  • Increased cardinality pressure on the TSDB
  • Difficulty querying or aggregating data

Stick to low-cardinality labels like endpoint, region, or status_code. Avoid anything unbounded or user-specific.

💡
This is exactly the kind of optimization challenge Last9 helps with, automatically identifying inefficient bucket layouts and suggesting improvements based on your actual traffic pattern.

Advanced Histogram Techniques

Once you’re comfortable with basic bucket configuration, you can go a step further with programmatically defined buckets and derived quantile analysis.

Programmatic Bucket Generation

In dynamic systems where latency profiles evolve or vary across services, hardcoded bucket boundaries might not be sufficient. You can generate buckets programmatically for more precise control.

Logarithmic Bucket Scaling

Logarithmic spacing is ideal when the metric spans several orders of magnitude—e.g., sub-millisecond to multi-second response times.

func generateLogarithmicBuckets(min, max float64, count int) []float64 {
	buckets := make([]float64, count)
	logMin := math.Log(min)
	logMax := math.Log(max)
	for i := 0; i < count; i++ {
		ratio := float64(i) / float64(count-1)
		buckets[i] = math.Exp(logMin + ratio*(logMax-logMin))
	}
	return buckets
}

// Example: 10 buckets from 1ms to 10s
buckets := generateLogarithmicBuckets(0.001, 10, 10)

This creates exponentially wider buckets as you move up the latency scale, helpful when performance degrades non-linearly.

Clustered Buckets Around SLO Thresholds

To capture detail near a service-level objective (SLO) boundary—e.g., around 300ms—generate more bucket density around that target.

func generateClusteredBuckets(target, spread float64, count int) []float64 {
	buckets := make([]float64, count)
	for i := 0; i < count; i++ {
		position := float64(i)/float64(count-1)*2 - 1 // [-1, 1]
		buckets[i] = target + spread*math.Tanh(position*2)
	}
	sort.Float64s(buckets)
	return buckets
}

// Example: 8 buckets clustered around 0.3s ±0.2s
sloBuckets := generateClusteredBuckets(0.3, 0.2, 8)

This approach provides tighter granularity around the latency threshold that matters most to you.

Quantile Calculation with PromQL

Prometheus exposes quantile approximations using histogram_quantile() over cumulative histogram data. For example:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

This estimates the 95th percentile over a 5-minute sliding window. For deeper analysis:

  • Compare percentile drift across time windows:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) /
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1h])) by (le))

A rising ratio suggests short-term latency spikes compared to the hourly baseline.

  • Persist quantiles using recording rules:
groups:
- name: latency
  rules:
  - record: service:http_duration_seconds:p95_5m
    expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Useful for alerting, dashboards, or longer-term trend analysis.

  • Detect skew via percentile ratios:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) /
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

An increasing 99th-to-50th percentile ratio can indicate worsening tail latency while median performance appears stable.

Note: Quantile accuracy is bounded by bucket granularity. For better precision, define buckets more densely around the percentile of interest.

💡
For more PromQL examples that can help you build better alerts and dashboards, check out our PromQL tricks you should know guide.

Histogram Aggregation and Multi-Window Analysis

One of the key advantages of Prometheus histograms is that they aggregate cleanly across dimensions—especially across instances, services, or regions. This makes them ideal for calculating global percentiles in distributed environments.

Aggregating Percentiles Across Instances

Since histogram buckets are cumulative and aligned by le (less-than-or-equal) boundaries, they can be safely summed across instances before applying histogram_quantile().

Example: Global p99 latency for all frontend pods

histogram_quantile(
  0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

This query combines data across all time series with the same bucket structure, giving a fleet-wide view of the 99th percentile.

Multi-Window Percentile Comparison

You can also compare percentiles across different time windows to detect shifts in latency behavior.

Example: p95 change in the last hour vs. daily baseline

histogram_quantile(
  0.95,
  sum(rate(http_request_duration_seconds_bucket[1h])) by (le)
)
/
histogram_quantile(
  0.95,
  sum(rate(http_request_duration_seconds_bucket[1d])) by (le)
)

If this ratio exceeds 1.0, it suggests that p95 latency has increased in the recent window compared to the long-term baseline. A spike in this ratio can be used to trigger alerts or highlight services for deeper investigation.

Considerations

  • This technique assumes consistent bucket boundaries across all series. Custom bucket layouts per service will break aggregation.
  • Larger window ranges (like [1d]) require sufficient retention in your TSDB and can be resource-intensive depending on scrape intervals.
  • The closer your buckets are to the percentile of interest, the more accurate the comparison will be.

Histogram vs. Summary: Choosing the Right Distribution Metric

Prometheus provides two options for capturing distributions: histograms and summaries. While they seem similar on the surface, they behave very differently, especially when it comes to aggregation and percentile accuracy.

Here's a technical comparison:

Feature Histogram Summary
Server-side aggregation Yes (can aggregate across instances) No (percentiles are not mergeable)
Client-side percentiles No (calculated during query time) Yes (calculated during collection)
Calculation flexibility High (query-time percentiles) Low (fixed percentiles only)
Accuracy Depends on bucket layout Higher (exact quantiles within time window)
CPU usage Lower on client side Higher on client side
Memory usage Lower on clients Higher on clients
Storage usage Higher (more time series per metric) Lower (fewer time series)
Query performance Can be slower (more complex queries) Faster (pre-aggregated quantiles)
Alerting and dashboard support Better ecosystem support More limited tooling support

When to Use Histograms

Opt for histograms if:

  • You need global percentiles across multiple instances
    (e.g., p99 across all frontend pods)
  • You want flexibility in choosing percentiles at query time
  • Your use case involves heatmaps or distribution charts
  • You expect percentile definitions to evolve over time
  • You're optimizing for client-side performance (low CPU/memory)

When to Use Summaries

Use summaries when:

  • You need high-accuracy percentiles without approximation
  • Aggregation across instances is not required
  • You can define fixed percentiles upfront (e.g., only p95 and p99)
  • You're working with a small number of targets
  • Query performance is a top concern
💡
When you're working with histogram buckets at scale, ensuring high availability is key. Learn more about high availability in Prometheus to keep your monitoring reliable and robust.

Histogram Bucket Configurations for Common Services

Proper bucket configuration is essential for capturing accurate latency distributions. Below are commonly used configurations optimized for different types of workloads.

REST API Latency Buckets

// Target: 200ms SLO for RESTful endpoints
apiLatencyBuckets := []float64{
    0.005, 0.025, 0.050, 0.100,
    0.150, 0.200, 0.300, 0.500,
    1.000, 2.500, 5.000, 10.000,
}

Rationale:
This layout increases resolution near the 200-ms threshold and extends coverage to 10s for long-tail degradations.

Database Query Latency Buckets

// Target: Sub-100ms for OLTP, but up to 60s for analytical queries
dbQueryBuckets := []float64{
    0.001, 0.005, 0.010, 0.025, 0.050, 0.100,
    0.250, 0.500, 1.000, 2.500, 5.000, 10.000,
    30.000, 60.000,
}

Rationale:
Captures a wide latency spectrum from fast key lookups to slower analytical workloads.

Background Job Execution Buckets

// Target: Tracks jobs from 1s to 2h
jobProcessingBuckets := []float64{
    1, 5, 15, 30, 60,
    180, 300, 600, 1200,
    1800, 3600, 7200,
}

Rationale:
Designed for batch workloads with long runtime variance. Covers both short-lived and long-running tasks.

Event Processing Pipeline Buckets

// Target: Microservice pipelines, including external service interactions
eventPipelineBuckets := []float64{
    0.010, 0.050, 0.100, 0.250,
    0.500, 1.000, 2.500, 5.000,
    10.000, 15.000,
}

Rationale:
Spans fast internal processing and multi-second delays from downstream service dependencies.

File Upload/Download Buckets

// Target: File transfer operations
fileTransferBuckets := []float64{
    0.100, 0.500, 1.000, 2.500,
    5.000, 10.000, 30.000, 60.000,
    120.000, 300.000,
}

Rationale:
Latency is proportional to file size; this configuration scales to handle small transfers through to multi-minute media operations.

💡
For a deeper look into how histogram buckets can impact resource usage and storage efficiency, check out this guide on common Prometheus pitfalls.

Implement Effective Alerting Using Histogram Metrics

Histograms enable more accurate alerting by capturing percentiles rather than relying on averages. This allows systems to trigger alerts only when a real user-facing impact is likely.

Below are practical techniques to structure a robust alerting strategy based on histogram data in Prometheus.

1. Multi-level Percentile Alerting

Use percentile-based thresholds to classify latency issues by severity.

groups:
- name: LatencyAlerts
  rules:
  # Warning: P95 exceeds SLO
  - alert: HighP95Latency
    expr: |
      histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 0.5
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High latency on {{ $labels.service }}"
      description: "P95 latency above 500ms for {{ $labels.service }} over the last 10 minutes"
      dashboard: "https://grafana.example.com/d/latency/service-latency?var-service={{ $labels.service }}"
      runbook: "https://wiki.example.com/sre/runbooks/high-latency"

  # Critical: P99 exceeds SLO
  - alert: CriticalP99Latency
    expr: |
      histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1.0
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Critical latency on {{ $labels.service }}"
      description: "P99 latency above 1s for {{ $labels.service }} over the last 5 minutes"
      dashboard: "https://grafana.example.com/d/latency/service-latency?var-service={{ $labels.service }}"
      runbook: "https://wiki.example.com/sre/runbooks/critical-latency"

This setup separates moderate degradation from severe impact and helps prioritize incidents effectively.

2. Detecting Latency Distribution Skew

Monitoring percentiles alone may miss problems where only the tail worsens. Track skew by comparing tail and median values:

- alert: LatencyDistributionSkew
  expr: |
    histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) /
    histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 10
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Latency distribution skew detected"
    description: "P99/P50 ratio exceeds 10x for {{ $labels.service }}, suggesting significant tail latency growth"

This captures scenarios where the median is healthy but a growing tail indicates unstable performance.

3. Alerting on SLO Burn Rate

Burn rate-based alerting focuses on how fast your error budget is being consumed, not just static thresholds.

# Recording rule to capture SLO violations
- record: service:requests_exceeding_slo:ratio_5m
  expr: |
    (
      sum(rate(http_request_duration_seconds_count[5m])) by (service)
      -
      sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (service)
    )
    /
    sum(rate(http_request_duration_seconds_count[5m])) by (service)
# Burn rate alert for 4x the allowable error budget (example: 5% allowed)
- alert: SLOBurnRateTooHigh
  expr: service:requests_exceeding_slo:ratio_5m > 0.20
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "SLO burn rate high for {{ $labels.service }}"
    description: "Error budget is being consumed at >4x the allowed rate for {{ $labels.service }}"

This pattern allows early detection before you breach SLO targets over a longer period.

4. Multi-Window Alerting for Stability

Reduce false positives by validating metrics over short and long windows simultaneously.

- alert: SustainedLatencyIncrease
  expr: |
    (
      histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 0.5
    )
    and
    (
      histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1h])) by (le, service)) > 0.5
    )
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Sustained latency increase"
    description: "P95 latency consistently above 500ms in both 5-minute and 1-hour windows for {{ $labels.service }}"

By aligning short-term spikes with long-term trends, this reduces noise and increases signal clarity.

Histogram Bucket Performance Optimization Techniques

For high-traffic services, histogram buckets can generate significant storage needs and performance overhead.

Here are advanced optimization techniques to keep your Prometheus deployment efficient:

1. Strategic Label Usage to Control Cardinality

Labels multiply the number of time series dramatically. Each unique combination of label values creates a complete set of histogram buckets:

// BAD: High cardinality - creates unique buckets per endpoint AND method
httpRequestDuration.WithLabelValues(endpoint, method, statusCode).Observe(duration)

// BETTER: Group by meaningful dimensions only
httpRequestDuration.WithLabelValues(endpoint, statusCode).Observe(duration)

Impact calculation: With 10 buckets, 100 endpoints, 4 methods, and 5 status codes:

  • Bad approach: 10 × 100 × 4 × 5 = 20,000 time series
  • Better approach: 10 × 100 × 5 = 5,000 time series (75% reduction)

Consider creating separate histograms for different dimensions rather than using labels when appropriate.

2. Implementing Client-side Aggregation

For services with many instances, perform client-side aggregation:

// Use PushGateway for batch processing jobs
func submitHistogramOnCompletion() {
    registry := prometheus.NewRegistry()
    registry.MustRegister(jobDurationHistogram)
    
    pusher := push.New("pushgateway:9091", "batch_job").
        Gatherer(registry)
        
    // Push metrics once at the end of the job
    if err := pusher.Push(); err != nil {
        log.Errorf("Could not push to Pushgateway: %v", err)
    }
}

Or use a pull approach with metric aggregation:

// Configure local aggregation with Prometheus agent mode
prometheus:
  global:
    scrape_interval: 15s
  agent:
    mode: true
  remote_write:
    - url: "https://prometheus-central:9090/api/v1/write"
      name: central_prometheus

3. Bucket Selection Optimization

Remove unnecessary buckets that don't provide valuable insights:

// Original buckets
originalBuckets := []float64{0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10}

// Optimized buckets that still cover key thresholds but with fewer points
optimizedBuckets := []float64{0.01, 0.05, 0.1, 0.5, 1, 5, 10}

Storage impact: Going from 14 buckets to 7 can reduce TSDB storage requirements by 50% for that histogram.

4. Sampling and Filtering Techniques

For ultra-high-volume metrics, consider implementing sampling:

func shouldSample() bool {
    return rand.Float64() < 0.1 // 10% sampling rate
}

func handleRequest() {
    // Always measure the duration
    start := time.Now()
    // ... handle the request ...
    duration := time.Since(start).Seconds()
    
    // But only record to histogram for a percentage of requests
    if shouldSample() {
        requestDurationHistogram.Observe(duration)
    }
}

This works well for services with thousands of requests per second where you don't need to measure every request.

5. Time Series Retention and Downsampling

Configure appropriate retention periods based on access patterns:

# In prometheus.yml
storage:
  tsdb:
    # Retain raw histogram data for 15 days
    retention.time: 15d

# Use recording rules for downsampled long-term storage
- record: job:request_duration:histogram_p95_1h
  expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1h])) by (le, job))

This pattern keeps full-resolution histogram data for recent analysis while preserving key metrics for longer-term trending.

💡
To make the most of histogram-based alerts, you might also find these Prometheus alerting examples useful.

Troubleshooting Common Prometheus Histogram Implementation Issues

Here's a detailed troubleshooting guide for common problems:

Diagnosing Inaccurate Percentile Calculations

When your histogram_quantile calculations produce unexpected or seemingly wrong results:

Problem: Percentiles jumping erratically between queries

  • Root cause: Insufficient data in the time window or poor bucket selection

Solution: Increase the time window in rate() function:

# More stable with longer windowhistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))

Problem: Percentiles always land exactly on bucket boundaries

  • Root cause: Linear interpolation assumes even distribution within buckets

Solution: Add more buckets around critical percentiles:

// Add fine-grained buckets around p95 target (300ms)[]float64{0.25, 0.27, 0.29, 0.3, 0.31, 0.33, 0.35}

Problem: Percentiles reporting lower than minimum observed values

  • Root cause: Often occurs with low traffic and rate() calculations

Solution: Use increase() instead of rate() for low-volume services:

histogram_quantile(0.95, sum(increase(http_request_duration_seconds_bucket[10m])) by (le))

Resolving High Cardinality Explosions

When histograms cause excessive storage or memory usage:

Problem: Prometheus crashes or slows dramatically after adding histograms

  • Root cause: Too many label combinations multiplying bucket cardinality
  • Solution: Implement one or more of these fixes:
    1. Reduce label dimensions on high-cardinality histograms
    2. Increase Prometheus storage allocation
    3. Shard your Prometheus instances by metric type

Check: Run this query to identify the worst offenders:

topk(10, count by (__name__, job) ({__name__=~".+_bucket"}))

Problem: Queries on histogram data become extremely slow

  • Root cause: Too many histogram buckets across too many services

Solution: Create recording rules for common percentile calculations:

- record: job:http_request_duration:p95_5m  expr: histogram_quantile(0.95, sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
💡
If you're looking to automate or integrate with Prometheus, the Prometheus API guide will show you how to interact with your data programmatically.

Fixing Missing or Incomplete Histogram Data

When your histograms aren't capturing all the data you expect:

Problem: Some requests don't appear in any bucket

  • Root cause: Bucket range doesn't cover all values
  • Solution: Add an explicit +Inf bucket or check for dropped metrics

Diagnosis: Check the difference between count and sum of bucket counts:

sum(http_request_duration_seconds_count) - sum(http_request_duration_seconds_bucket{le="+Inf"})

(Should be zero; if not, there's a problem)

Problem: Histogram data disappears after service restarts

  • Root cause: Counter reset behavior with incorrect query formulation

Solution: Use increase() or rate() instead of raw counters:

# Handles counter resets properlysum(increase(http_request_duration_seconds_bucket[5m])) by (le)

Problem: Inconsistent histogram data across service instances

  • Root cause: Different bucket configurations between instances
  • Solution: Standardize histogram bucket definitions in a shared configuration or library

Resolving Resource Consumption Issues

When histograms consume excessive resources:

Problem: Prometheus storage growing too quickly

  • Root cause: Too many histograms with too many buckets

Solution: Implement a histogram bucket reduction strategy:

// Before: 14 buckets[]float64{0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20}// After: 7 strategically chosen buckets[]float64{0.001, 0.01, 0.1, 0.5, 1, 10, 20}

Problem: Histogram instrumentation adds too much overhead

  • Root cause: High-frequency observations in critical paths

Solution: Implement adaptive sampling:

sampleRate := 0.01 // Sample 1% by defaultif duration > 1.0 { // But sample 100% of slow requests  sampleRate = 1.0}if rand.Float64() <= sampleRate {  requestDurationHistogram.Observe(duration)}

Making Histogram Data Actionable at Scale

The techniques in this guide work great until you're managing dozens of services with hundreds of histograms. That's when you need tooling that handles the complexity for you.

Where Teams Hit Walls

  • Query complexity: Writing correct histogram_quantile() queries for every use case
  • Bucket optimization: Manually tuning buckets across different service types
  • SLO management: Tracking error budgets and burn rates across multiple teams
  • Incident correlation: Connecting latency spikes to deployments, infrastructure changes, or external dependencies

How Last9 Extends Prometheus Histograms

  • Visual SLO tracking: See error budget consumption and burn rates without writing PromQL
  • Smart alerting: Get notified about percentile degradation before users complain
  • Automated optimization: Identify inefficient bucket configurations and high-cardinality issues
  • Correlation engine: Automatically link latency spikes to deployments, infrastructure metrics, or external service issues

Built for teams using these exact techniques but needing to scale beyond manual PromQL management.

Start for free or talk to us about your use case. We'd be happy to show our platform capabilities and how it can help!

💡
If you've any questions or experiences to share about working with Prometheus histogram buckets, join our Discord Community to connect with other engineers tackling similar challenges!

FAQs

What exactly is the difference between histograms and summary metrics in Prometheus?

Histograms and summaries track distribution data differently:

Histograms:

  • Store observations in configurable buckets (counters of values ≤ each threshold)
  • Calculate percentiles at query time using histogram_quantile()
  • Allow aggregation across multiple instances (crucial for distributed systems)
  • Provide flexibility to calculate any percentile without pre-configuration
  • Take less client-side CPU but more storage space
  • Work well with Prometheus recording rules and alerting

Summaries:

  • Pre-calculate percentiles in the client application
  • Store specific quantiles (e.g., 0.5, 0.9, 0.99) directly
  • Provide more accurate percentiles within single instances
  • Cannot be meaningfully aggregated across instances
  • Use more client-side resources but less storage
  • Have fixed percentiles that can't be changed after collection

Choose histograms when you need cross-instance aggregation or flexible percentile selection. Choose summaries when you need exact percentiles on single instances.

How do I determine the optimal number of buckets for a Prometheus histogram?

The optimal bucket count balances accuracy against resource usage:

  • General guideline: 10-15 buckets work well for most applications
  • Minimum effective number: At least 7 buckets (to cover 2-3 orders of magnitude)
  • Resource-constrained systems: Stick to 7-10 strategically placed buckets
  • High-precision requirements: Up to 20-25 buckets, focusing resolution where needed

Focus bucket density around:

  1. Your SLO thresholds (e.g., more buckets around your p95 target)
  2. User experience breakpoints (e.g., 100ms, 300ms, 1s)
  3. Expected operational ranges for your specific service

Remember that each bucket creates a separate time series, so costs grow linearly with bucket count.

When changing histogram bucket definitions, what happens to historical data?

When you modify histogram bucket definitions:

  • New time series creation: Prometheus creates entirely new time series for the new buckets
  • Historical data limitation: Historical data won't be retroactively available in the new buckets
  • Dual maintenance period: You'll need to maintain both old and new histograms during transition
  • Recording rule approach: For critical metrics, create recording rules with the old buckets before changing

Best practices for bucket changes:

  1. Plan bucket layouts carefully before going to production
  2. When changing is necessary, keep the old metric name for twice your retention period
  3. Use a new metric name for the new bucket layout (e.g., http_request_duration_seconds_v2)
  4. Create a recording rule that combines old and new data during the transition

What's the most effective way to calculate accurate percentiles from Prometheus histogram buckets?

For accurate percentile calculations:

# Basic p99 calculation
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# For more stability in low-traffic services
histogram_quantile(0.99, sum(increase(http_request_duration_seconds_bucket[10m])) by (le))

# Aggregating across job instances while preserving endpoint dimension
histogram_quantile(0.95, sum by (le, endpoint) (rate(http_request_duration_seconds_bucket[5m])))

To improve accuracy:

  1. Use more buckets around the percentile you're calculating
  2. Use longer time windows for stability (5-10m instead of 1m)
  3. For critical percentiles, create recording rules to ensure calculation consistency

Remember that percentile accuracy is always limited by your bucket layout—more buckets around key percentiles yield better accuracy.

How can high cardinality with Prometheus histograms be effectively managed?

High cardinality management strategies:

  1. Label discipline:
    • Limit high-cardinality labels (like user_id, request_id) from histograms
    • Use no more than 2-3 label dimensions per histogram
    • Move high-cardinality dimensions to separate metrics when needed
  2. Bucket optimization:
    • Use only necessary buckets (8-12 is often sufficient)
    • Standardize bucket layouts across services
    • Remove buckets that don't provide actionable insights
  3. Architecture approaches:
    • Implement client-side aggregation for high-volume services
    • Use federation or hierarchical Prometheus for large-scale deployments
    • Create recording rules for commonly queried percentiles
  4. Sampling techniques:
    • Implement probabilistic sampling for ultra-high-volume services
    • Use higher sampling rates for outliers and errors
    • Consider exemplar-based approaches for detailed analysis

How can I improve the accuracy of percentiles calculated from histogram buckets?

Percentile accuracy depends on your bucket configuration:

  1. Add targeted bucket density:
    • Place more buckets around critical percentiles (e.g., your p95 or p99 target)
    • Example: For a p95 target of 300ms, add buckets at 250ms, 275ms, 300ms, 325ms, 350ms
  2. Use logarithmic distribution:
    • Linear buckets create poor resolution; use exponential/logarithmic spacing
    • Evenly distribute bucket density in log-space, not linear space
  3. Incorporate historical performance:
    • Analyze several weeks of data to identify your actual distribution
    • Place buckets based on observed percentiles, not theoretical ones
  4. Evaluate specific service patterns:
    • Services with bimodal distributions need buckets covering both modes
    • Cache-heavy services need extra resolution in lower latency ranges

The theoretical maximum accuracy is ±(upper_bound - lower_bound)/2 for the bucket containing your percentile.

Beyond request timing, what other metrics benefit from histogram bucket analysis?

Histograms are valuable for many distributions beyond request duration:

  • Resource utilization: Memory usage, CPU utilization, disk IOPS
  • Queue metrics: Queue depth, time in queue, batch sizes
  • Network performance: Packet sizes, network latency, throughput
  • Database metrics: Query execution time, connection pool usage, row counts
  • Cache performance: Cache hit ratios, time-to-cache, object sizes
  • User behavior: Session duration, items per cart, clicks per session
  • Message processing: Message size, processing latency, retry counts
  • Batch job metrics: Records processed per second, job duration, error rates
  • API response sizes: Payload sizes for requests and responses
  • Thread pool metrics: Thread usage, task execution time, queue wait time

The pattern applies whenever you need to understand a distribution rather than just averages or totals.

How do I implement histogram bucket monitoring for non-time measurements like request sizes?

For non-time measurements:

  1. Adjust bucket scales to match data characteristics:
    • Memory usage: Consider MB-scale buckets like [128, 256, 512, 1024, 2048, 4096]
    • Queue depth: Use application-appropriate buckets like [1, 5, 10, 50, 100, 500]
    • Message counts: Linear buckets might work better, e.g., [10, 20, 50, 100, 250, 500]

Create SLOs on size distributions when appropriate:

# Alert when p95 message size exceeds 500KB
histogram_quantile(0.95, sum(rate(message_size_bytes_bucket[5m])) by (le, topic)) > 512000

Observe distribution patterns:

// Track API response sizes
sizeBytes := float64(len(responseData))
responseSizeHistogram.Observe(sizeBytes)

Choose appropriate units and scale:

// For request sizes in bytes, using powers-of-10 scale
requestSizeHistogram := prometheus.NewHistogram(prometheus.HistogramOpts{
    Name:    "http_request_size_bytes",
    Help:    "HTTP request size in bytes",
    Buckets: []float64{10, 100, 1000, 10000, 100000, 1000000, 10000000},
})
Authors
Prathamesh Sonpatki

Prathamesh Sonpatki

Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

X

Contents

Do More with Less

Unlock unified observability and faster triaging for your team.