Ever seen an average latency of 200ms on your dashboard while users are still hitting timeouts? That disconnect usually points to one thing: your metrics aren’t telling the full story.
Request durations, payload sizes, and other performance data rarely follow clean, predictable patterns. Averages flatten the spikes, hiding the outliers that often matter most in production.
Prometheus histograms offer a better approach. They let you track how values are distributed across fast, slow, and painfully slow responses. But getting value out of histograms is about choosing the right bucket boundaries, understanding how queries like histogram_quantile()
work, and avoiding the common pitfalls that come with high-cardinality setups.
This blog walks through how histogram buckets work, how to configure them properly, and how to use them to surface real performance issues.
Understanding Prometheus Histogram Buckets
Prometheus histograms are used to capture the distribution of observed values across a set of predefined thresholds. Unlike counters or gauges, which give you totals or point-in-time values, histograms let you ask: how many requests fell under 100ms, how many under 500ms, and how many were slower?
This is critical when tracking metrics like HTTP request durations, payload sizes, or queue processing times—anything where the range and shape of the data matters more than a simple average.
How Histogram Buckets Work
A histogram metric in Prometheus is made up of three components:
*_bucket{le="<upper_bound>"}
— a counter for each bucket, showing how many values were less than or equal to that threshold*_sum
— the total sum of all recorded values*_count
— the total number of observations
For example, if you define buckets at 0.1
, 0.5
, and 1.0
seconds for http_request_duration_seconds
, you’ll get cumulative counts of how many requests completed in:
- ≤ 100ms
- ≤ 500ms
- ≤ 1s
Prometheus automatically generates all three components when you instrument a histogram metric. These raw series form the basis for percentile estimation using PromQL functions like histogram_quantile()
—which we’ll cover shortly.
Why Histogram Buckets Are Important in Production
Histogram buckets are essential when averages stop telling the truth.
In production, what breaks user experience isn’t the average latency, it’s the outliers. Histograms help expose those long-tail behaviors that typical metrics flatten out.
What You Get from Histograms
- Outlier visibility: A service with a 300ms average might still have 1% of requests taking 5+ seconds. Histograms surface that tail.
- SLO accuracy: You can define SLOs at the 95th or 99th percentile—based on real distribution data, not just mean values.
- On-the-fly percentiles: Use
histogram_quantile()
in PromQL to calculate percentiles dynamically, without needing to predefine them at ingest time. - Trend detection: Spot slow drifts in performance, like a creeping p95 latency, even when the average looks stable.
- Capacity planning: Understand how request durations shift under load, helping you plan for scaling and throttling.
- User experience correlation: Link slow responses to specific parts of the user journey by breaking down latency into time bands.
Example:
Let’s say your dashboard shows a 1.2s average page load time on your e-commerce site; it seems fine at first glance. But histogram data reveals that during peak traffic, 10% of checkout requests take over 4 seconds. That delay directly maps to a spike in cart abandonment.
Without histogram buckets, this insight is lost. You’d be optimizing for the wrong thing, fixing what looks fine, while ignoring what’s hurting users.
Setting Up Prometheus Histograms
Quick Start: Deploy Histogram Buckets
If you're tracking latency and need meaningful histograms now, this setup gives you reliable visibility without blowing up your cardinality budget.
Example: Bucket Configuration for REST APIs (Go + Prometheus client)
responseTime := promauto.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{0.01, 0.05, 0.1, 0.5, 1, 5, 10}, // 10ms to 10s range
})
func handleRequest(endpoint string, statusCode int) {
start := time.Now()
defer func() {
duration := time.Since(start).Seconds()
responseTime.WithLabelValues(endpoint, getStatusClass(statusCode)).Observe(duration)
}()
// Handle request
}
func getStatusClass(code int) string {
switch {
case code < 300:
return "2xx"
case code < 400:
return "3xx"
case code < 500:
return "4xx"
default:
return "5xx"
}
}
Alerting on Latency Spikes
- alert: HighLatencyP95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 0.5
for: 10m
Why This Works
- Bucket Coverage: Ranges from ultra-fast cache hits (~10ms) to severe slowdowns (~10s), with higher resolution between 50ms–500ms—ideal for web APIs.
- Manageable Cardinality: Two dimensions (
endpoint
,status_class
) keep time series counts under control while still offering useful drill-downs. - Ready for Aggregation: Histogram data can be combined across services or regions for accurate percentiles at scale.
This setup provides observability that scales with your application—without tuning knobs every sprint.
Implementation Examples with Go and Python
You can start using histograms with just a few lines of code. Prometheus client libraries make it easy to define custom buckets and record observations. Here’s how it works.
Example in Go
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
"time"
)
// Define a histogram with custom bucket boundaries
var responseTimeHistogram = promauto.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: []float64{0.1, 0.3, 0.5, 0.7, 1.0, 2.0, 5.0, 10.0}, // in seconds
})
func handleRequest() {
start := time.Now()
// ... handle the request ...
duration := time.Since(start).Seconds()
responseTimeHistogram.Observe(duration)
}
This setup tracks how many requests fall into each time range—100ms, 300ms, 500ms, and so on. Prometheus automatically updates the appropriate buckets whenever you call Observe()
.
If you don't want to define buckets manually, you can also use Prometheus' default set:
var responseTimeHistogram = promauto.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
// Uses default buckets: [0.005, 0.01, 0.025, ..., 10]
})
Example in Python
from prometheus_client import Histogram
import time
# Define a histogram with custom buckets
REQUEST_TIME = Histogram(
'http_request_duration_seconds',
'HTTP request duration in seconds',
buckets=[0.1, 0.3, 0.5, 0.7, 1.0, 2.0, 5.0, 10.0]
)
def process_request():
start = time.time()
# ... handle the request ...
duration = time.time() - start
REQUEST_TIME.observe(duration)
Both examples follow the same pattern:
- Define the histogram with meaningful bucket ranges.
- Measure the duration.
- Record the observation using
Observe()
.
This gives Prometheus everything it needs to track request distributions and calculate percentiles later using PromQL.
rate()
, histogram_quantile()
, and friends.Configure Histogram Buckets
The Right Histogram Buckets
Bucket boundaries have a big impact on how useful your histogram data turns out to be. The goal is to capture enough detail to support real analysis, without adding unnecessary overhead or noise.
Here’s a structured way to think about it.
Start with Your SLOs
Begin with what matters for your service.
- If your SLO is “99% of requests under 300ms,” include buckets around 200ms, 300ms, and 400ms.
- For multiple SLOs (like 95% under 200ms and 99% under 500ms), define boundaries that cover each range.
This makes it easier to evaluate how close you are to SLO thresholds using percentiles like p95 or p99.
Consider User-Perceived Latency
Latency impacts users differently depending on how long they wait. These general ranges often map well to how delays are perceived:
- ≤ 100ms — feels instant
- 100–300ms — noticeable but fine
- 300–1000ms — introduces some friction
1s — often feels broken
Including buckets around these thresholds helps bridge the gap between system metrics and user experience.
Use Logarithmic or Exponential Scales
Uniform bucket spacing can miss important patterns. Instead, use exponential or logarithmic spacing to capture a wider range with more detail where needed.
Examples:
[]float64{0.0625, 0.125, 0.25, 0.5, 1, 2, 4, 8}
Or base-10 with intermediate steps:
[]float64{0.001, 0.01, 0.05, 0.1, 0.5, 1, 5, 10}
Add Resolution Where It Matters
If your SLO is strict, say, 99% under 300ms, it’s helpful to add more granularity around that point:
[]float64{0.25, 0.275, 0.3, 0.325, 0.35}
This makes it easier to see small shifts that could push you past your threshold.
A Practical Starting Point
If you're looking for a default set that works well for most web applications, this one covers both fast and slow responses:
[]float64{0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10}
It’s balanced enough to catch both quick successes and longer outliers, without overwhelming your storage or query engine.
Which Buckets Should I Use?
Here's what works based on service type:
Service Type | Bucket Configuration | Why These Work | Alert Threshold |
---|---|---|---|
REST API | [0.01, 0.05, 0.1, 0.5, 1, 5, 10] |
Covers fast responses to timeouts | p95 > SLO |
Database | [0.001, 0.01, 0.1, 0.5, 1, 10, 60] |
Cache hits to analytical queries | p95 > 100ms |
Queue Processing | [0.1, 1, 5, 30, 300, 1800] |
Event processing to batch jobs | p99 > timeout |
File Operations | [0.1, 1, 5, 30, 120, 600] |
Small files to large uploads | p95 > user patience |
External API Calls | [0.05, 0.2, 1, 5, 15, 30] |
Network calls with timeouts | p99 > circuit breaker |
Service-specific refinements:
- High SLO requirements (p99 < 100ms)? Add buckets at
[0.025, 0.075, 0.125]
- Long-running operations? Extend upper range:
[..., 1800, 3600, 7200]
- Microservices with external deps? Focus mid-range:
[0.1, 0.5, 2, 10]
Start with the standard config for your service type, run it for a week, then check your actual p95/p99 values and adjust bucket density around those ranges.
Common Mistakes When Using Histogram Buckets
Histograms are powerful, but misconfigured buckets can quietly cause serious problems—from bloated storage to misleading insights. Here are the most common pitfalls and how to avoid them.
1. Using Too Many Buckets
Adding more buckets might seem like it gives better visibility—but in practice, it often leads to:
- Excessive time series creation
- Higher memory and CPU usage
- Slower queries
- Increased storage costs
For example, using 50 buckets across dozens of endpoints quickly results in thousands of active series. Most setups don’t need that level of granularity. A more practical approach is to use 10–15 buckets focused on key thresholds (like SLO boundaries and tail latency cutoffs).
- Wrong Bucket Strategy
Linear buckets (e.g., 100ms, 200ms, 300ms…) rarely align with the shape of latency or payload distributions, which are usually skewed or heavy-tailed.
Using linear buckets can:
- Overrepresent rarely occurring values
- Miss details in the critical lower range
- Flatten useful patterns
Exponential buckets (e.g., 5ms, 10ms, 50ms, 100ms, 500ms…) offer better resolution where it matters, especially around the 95th or 99th percentile.
3. Incomplete Range Coverage
If the upper bound of the histogram is too low, any values beyond that range are grouped into the final bucket, which hides long-tail behavior.
For example, if the largest bucket is 5 seconds, a 15-second timeout gets lumped into the same category. This makes it hard to identify slowdowns or incidents that fall outside the “expected” range.
To avoid this, the largest bucket should comfortably cover at least 2–3× the maximum expected latency.
4. Skewed Bucket Placement
It’s common to define buckets only for the "normal" case, below the SLO target. That leaves no visibility into regressions.
When all buckets end below a 1-second target, any deviation looks flat or capped. This masks performance degradation until it's too late.
Define buckets that extend well past your SLO thresholds. This helps detect early signs of drift before it becomes an incident.
5. High-Cardinality Labels
Each histogram bucket is multiplied across all label combinations. Labels like user_id
, session_id
, or query_hash
can cause a combinatorial explosion in the number of time series.
This leads to:
- High memory usage
- Increased cardinality pressure on the TSDB
- Difficulty querying or aggregating data
Stick to low-cardinality labels like endpoint
, region
, or status_code
. Avoid anything unbounded or user-specific.
Advanced Histogram Techniques
Once you’re comfortable with basic bucket configuration, you can go a step further with programmatically defined buckets and derived quantile analysis.
Programmatic Bucket Generation
In dynamic systems where latency profiles evolve or vary across services, hardcoded bucket boundaries might not be sufficient. You can generate buckets programmatically for more precise control.
Logarithmic Bucket Scaling
Logarithmic spacing is ideal when the metric spans several orders of magnitude—e.g., sub-millisecond to multi-second response times.
func generateLogarithmicBuckets(min, max float64, count int) []float64 {
buckets := make([]float64, count)
logMin := math.Log(min)
logMax := math.Log(max)
for i := 0; i < count; i++ {
ratio := float64(i) / float64(count-1)
buckets[i] = math.Exp(logMin + ratio*(logMax-logMin))
}
return buckets
}
// Example: 10 buckets from 1ms to 10s
buckets := generateLogarithmicBuckets(0.001, 10, 10)
This creates exponentially wider buckets as you move up the latency scale, helpful when performance degrades non-linearly.
Clustered Buckets Around SLO Thresholds
To capture detail near a service-level objective (SLO) boundary—e.g., around 300ms—generate more bucket density around that target.
func generateClusteredBuckets(target, spread float64, count int) []float64 {
buckets := make([]float64, count)
for i := 0; i < count; i++ {
position := float64(i)/float64(count-1)*2 - 1 // [-1, 1]
buckets[i] = target + spread*math.Tanh(position*2)
}
sort.Float64s(buckets)
return buckets
}
// Example: 8 buckets clustered around 0.3s ±0.2s
sloBuckets := generateClusteredBuckets(0.3, 0.2, 8)
This approach provides tighter granularity around the latency threshold that matters most to you.
Quantile Calculation with PromQL
Prometheus exposes quantile approximations using histogram_quantile()
over cumulative histogram data. For example:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
This estimates the 95th percentile over a 5-minute sliding window. For deeper analysis:
- Compare percentile drift across time windows:
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) /
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1h])) by (le))
A rising ratio suggests short-term latency spikes compared to the hourly baseline.
- Persist quantiles using recording rules:
groups:
- name: latency
rules:
- record: service:http_duration_seconds:p95_5m
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Useful for alerting, dashboards, or longer-term trend analysis.
- Detect skew via percentile ratios:
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) /
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
An increasing 99th-to-50th percentile ratio can indicate worsening tail latency while median performance appears stable.
Note: Quantile accuracy is bounded by bucket granularity. For better precision, define buckets more densely around the percentile of interest.
Histogram Aggregation and Multi-Window Analysis
One of the key advantages of Prometheus histograms is that they aggregate cleanly across dimensions—especially across instances, services, or regions. This makes them ideal for calculating global percentiles in distributed environments.
Aggregating Percentiles Across Instances
Since histogram buckets are cumulative and aligned by le
(less-than-or-equal) boundaries, they can be safely summed across instances before applying histogram_quantile()
.
Example: Global p99 latency for all frontend pods
histogram_quantile(
0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
This query combines data across all time series with the same bucket structure, giving a fleet-wide view of the 99th percentile.
Multi-Window Percentile Comparison
You can also compare percentiles across different time windows to detect shifts in latency behavior.
Example: p95 change in the last hour vs. daily baseline
histogram_quantile(
0.95,
sum(rate(http_request_duration_seconds_bucket[1h])) by (le)
)
/
histogram_quantile(
0.95,
sum(rate(http_request_duration_seconds_bucket[1d])) by (le)
)
If this ratio exceeds 1.0
, it suggests that p95 latency has increased in the recent window compared to the long-term baseline. A spike in this ratio can be used to trigger alerts or highlight services for deeper investigation.
Considerations
- This technique assumes consistent bucket boundaries across all series. Custom bucket layouts per service will break aggregation.
- Larger window ranges (like
[1d]
) require sufficient retention in your TSDB and can be resource-intensive depending on scrape intervals. - The closer your buckets are to the percentile of interest, the more accurate the comparison will be.
Histogram vs. Summary: Choosing the Right Distribution Metric
Prometheus provides two options for capturing distributions: histograms and summaries. While they seem similar on the surface, they behave very differently, especially when it comes to aggregation and percentile accuracy.
Here's a technical comparison:
Feature | Histogram | Summary |
---|---|---|
Server-side aggregation | Yes (can aggregate across instances) | No (percentiles are not mergeable) |
Client-side percentiles | No (calculated during query time) | Yes (calculated during collection) |
Calculation flexibility | High (query-time percentiles) | Low (fixed percentiles only) |
Accuracy | Depends on bucket layout | Higher (exact quantiles within time window) |
CPU usage | Lower on client side | Higher on client side |
Memory usage | Lower on clients | Higher on clients |
Storage usage | Higher (more time series per metric) | Lower (fewer time series) |
Query performance | Can be slower (more complex queries) | Faster (pre-aggregated quantiles) |
Alerting and dashboard support | Better ecosystem support | More limited tooling support |
When to Use Histograms
Opt for histograms if:
- You need global percentiles across multiple instances
(e.g., p99 across all frontend pods) - You want flexibility in choosing percentiles at query time
- Your use case involves heatmaps or distribution charts
- You expect percentile definitions to evolve over time
- You're optimizing for client-side performance (low CPU/memory)
When to Use Summaries
Use summaries when:
- You need high-accuracy percentiles without approximation
- Aggregation across instances is not required
- You can define fixed percentiles upfront (e.g., only p95 and p99)
- You're working with a small number of targets
- Query performance is a top concern
Histogram Bucket Configurations for Common Services
Proper bucket configuration is essential for capturing accurate latency distributions. Below are commonly used configurations optimized for different types of workloads.
REST API Latency Buckets
// Target: 200ms SLO for RESTful endpoints
apiLatencyBuckets := []float64{
0.005, 0.025, 0.050, 0.100,
0.150, 0.200, 0.300, 0.500,
1.000, 2.500, 5.000, 10.000,
}
Rationale:
This layout increases resolution near the 200-ms threshold and extends coverage to 10s for long-tail degradations.
Database Query Latency Buckets
// Target: Sub-100ms for OLTP, but up to 60s for analytical queries
dbQueryBuckets := []float64{
0.001, 0.005, 0.010, 0.025, 0.050, 0.100,
0.250, 0.500, 1.000, 2.500, 5.000, 10.000,
30.000, 60.000,
}
Rationale:
Captures a wide latency spectrum from fast key lookups to slower analytical workloads.
Background Job Execution Buckets
// Target: Tracks jobs from 1s to 2h
jobProcessingBuckets := []float64{
1, 5, 15, 30, 60,
180, 300, 600, 1200,
1800, 3600, 7200,
}
Rationale:
Designed for batch workloads with long runtime variance. Covers both short-lived and long-running tasks.
Event Processing Pipeline Buckets
// Target: Microservice pipelines, including external service interactions
eventPipelineBuckets := []float64{
0.010, 0.050, 0.100, 0.250,
0.500, 1.000, 2.500, 5.000,
10.000, 15.000,
}
Rationale:
Spans fast internal processing and multi-second delays from downstream service dependencies.
File Upload/Download Buckets
// Target: File transfer operations
fileTransferBuckets := []float64{
0.100, 0.500, 1.000, 2.500,
5.000, 10.000, 30.000, 60.000,
120.000, 300.000,
}
Rationale:
Latency is proportional to file size; this configuration scales to handle small transfers through to multi-minute media operations.
Implement Effective Alerting Using Histogram Metrics
Histograms enable more accurate alerting by capturing percentiles rather than relying on averages. This allows systems to trigger alerts only when a real user-facing impact is likely.
Below are practical techniques to structure a robust alerting strategy based on histogram data in Prometheus.
1. Multi-level Percentile Alerting
Use percentile-based thresholds to classify latency issues by severity.
groups:
- name: LatencyAlerts
rules:
# Warning: P95 exceeds SLO
- alert: HighP95Latency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 0.5
for: 10m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.service }}"
description: "P95 latency above 500ms for {{ $labels.service }} over the last 10 minutes"
dashboard: "https://grafana.example.com/d/latency/service-latency?var-service={{ $labels.service }}"
runbook: "https://wiki.example.com/sre/runbooks/high-latency"
# Critical: P99 exceeds SLO
- alert: CriticalP99Latency
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 1.0
for: 5m
labels:
severity: critical
annotations:
summary: "Critical latency on {{ $labels.service }}"
description: "P99 latency above 1s for {{ $labels.service }} over the last 5 minutes"
dashboard: "https://grafana.example.com/d/latency/service-latency?var-service={{ $labels.service }}"
runbook: "https://wiki.example.com/sre/runbooks/critical-latency"
This setup separates moderate degradation from severe impact and helps prioritize incidents effectively.
2. Detecting Latency Distribution Skew
Monitoring percentiles alone may miss problems where only the tail worsens. Track skew by comparing tail and median values:
- alert: LatencyDistributionSkew
expr: |
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) /
histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 10
for: 15m
labels:
severity: warning
annotations:
summary: "Latency distribution skew detected"
description: "P99/P50 ratio exceeds 10x for {{ $labels.service }}, suggesting significant tail latency growth"
This captures scenarios where the median is healthy but a growing tail indicates unstable performance.
3. Alerting on SLO Burn Rate
Burn rate-based alerting focuses on how fast your error budget is being consumed, not just static thresholds.
# Recording rule to capture SLO violations
- record: service:requests_exceeding_slo:ratio_5m
expr: |
(
sum(rate(http_request_duration_seconds_count[5m])) by (service)
-
sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m])) by (service)
)
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)
# Burn rate alert for 4x the allowable error budget (example: 5% allowed)
- alert: SLOBurnRateTooHigh
expr: service:requests_exceeding_slo:ratio_5m > 0.20
for: 15m
labels:
severity: warning
annotations:
summary: "SLO burn rate high for {{ $labels.service }}"
description: "Error budget is being consumed at >4x the allowed rate for {{ $labels.service }}"
This pattern allows early detection before you breach SLO targets over a longer period.
4. Multi-Window Alerting for Stability
Reduce false positives by validating metrics over short and long windows simultaneously.
- alert: SustainedLatencyIncrease
expr: |
(
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) > 0.5
)
and
(
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1h])) by (le, service)) > 0.5
)
for: 5m
labels:
severity: warning
annotations:
summary: "Sustained latency increase"
description: "P95 latency consistently above 500ms in both 5-minute and 1-hour windows for {{ $labels.service }}"
By aligning short-term spikes with long-term trends, this reduces noise and increases signal clarity.
Histogram Bucket Performance Optimization Techniques
For high-traffic services, histogram buckets can generate significant storage needs and performance overhead.
Here are advanced optimization techniques to keep your Prometheus deployment efficient:
1. Strategic Label Usage to Control Cardinality
Labels multiply the number of time series dramatically. Each unique combination of label values creates a complete set of histogram buckets:
// BAD: High cardinality - creates unique buckets per endpoint AND method
httpRequestDuration.WithLabelValues(endpoint, method, statusCode).Observe(duration)
// BETTER: Group by meaningful dimensions only
httpRequestDuration.WithLabelValues(endpoint, statusCode).Observe(duration)
Impact calculation: With 10 buckets, 100 endpoints, 4 methods, and 5 status codes:
- Bad approach: 10 × 100 × 4 × 5 = 20,000 time series
- Better approach: 10 × 100 × 5 = 5,000 time series (75% reduction)
Consider creating separate histograms for different dimensions rather than using labels when appropriate.
2. Implementing Client-side Aggregation
For services with many instances, perform client-side aggregation:
// Use PushGateway for batch processing jobs
func submitHistogramOnCompletion() {
registry := prometheus.NewRegistry()
registry.MustRegister(jobDurationHistogram)
pusher := push.New("pushgateway:9091", "batch_job").
Gatherer(registry)
// Push metrics once at the end of the job
if err := pusher.Push(); err != nil {
log.Errorf("Could not push to Pushgateway: %v", err)
}
}
Or use a pull approach with metric aggregation:
// Configure local aggregation with Prometheus agent mode
prometheus:
global:
scrape_interval: 15s
agent:
mode: true
remote_write:
- url: "https://prometheus-central:9090/api/v1/write"
name: central_prometheus
3. Bucket Selection Optimization
Remove unnecessary buckets that don't provide valuable insights:
// Original buckets
originalBuckets := []float64{0.005, 0.01, 0.025, 0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10}
// Optimized buckets that still cover key thresholds but with fewer points
optimizedBuckets := []float64{0.01, 0.05, 0.1, 0.5, 1, 5, 10}
Storage impact: Going from 14 buckets to 7 can reduce TSDB storage requirements by 50% for that histogram.
4. Sampling and Filtering Techniques
For ultra-high-volume metrics, consider implementing sampling:
func shouldSample() bool {
return rand.Float64() < 0.1 // 10% sampling rate
}
func handleRequest() {
// Always measure the duration
start := time.Now()
// ... handle the request ...
duration := time.Since(start).Seconds()
// But only record to histogram for a percentage of requests
if shouldSample() {
requestDurationHistogram.Observe(duration)
}
}
This works well for services with thousands of requests per second where you don't need to measure every request.
5. Time Series Retention and Downsampling
Configure appropriate retention periods based on access patterns:
# In prometheus.yml
storage:
tsdb:
# Retain raw histogram data for 15 days
retention.time: 15d
# Use recording rules for downsampled long-term storage
- record: job:request_duration:histogram_p95_1h
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[1h])) by (le, job))
This pattern keeps full-resolution histogram data for recent analysis while preserving key metrics for longer-term trending.
Troubleshooting Common Prometheus Histogram Implementation Issues
Here's a detailed troubleshooting guide for common problems:
Diagnosing Inaccurate Percentile Calculations
When your histogram_quantile
calculations produce unexpected or seemingly wrong results:
Problem: Percentiles jumping erratically between queries
- Root cause: Insufficient data in the time window or poor bucket selection
Solution: Increase the time window in rate()
function:
# More stable with longer windowhistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[10m])) by (le))
Problem: Percentiles always land exactly on bucket boundaries
- Root cause: Linear interpolation assumes even distribution within buckets
Solution: Add more buckets around critical percentiles:
// Add fine-grained buckets around p95 target (300ms)[]float64{0.25, 0.27, 0.29, 0.3, 0.31, 0.33, 0.35}
Problem: Percentiles reporting lower than minimum observed values
- Root cause: Often occurs with low traffic and rate() calculations
Solution: Use increase()
instead of rate()
for low-volume services:
histogram_quantile(0.95, sum(increase(http_request_duration_seconds_bucket[10m])) by (le))
Resolving High Cardinality Explosions
When histograms cause excessive storage or memory usage:
Problem: Prometheus crashes or slows dramatically after adding histograms
- Root cause: Too many label combinations multiplying bucket cardinality
- Solution: Implement one or more of these fixes:
- Reduce label dimensions on high-cardinality histograms
- Increase Prometheus storage allocation
- Shard your Prometheus instances by metric type
Check: Run this query to identify the worst offenders:
topk(10, count by (__name__, job) ({__name__=~".+_bucket"}))
Problem: Queries on histogram data become extremely slow
- Root cause: Too many histogram buckets across too many services
Solution: Create recording rules for common percentile calculations:
- record: job:http_request_duration:p95_5m expr: histogram_quantile(0.95, sum by (le, job) (rate(http_request_duration_seconds_bucket[5m])))
Fixing Missing or Incomplete Histogram Data
When your histograms aren't capturing all the data you expect:
Problem: Some requests don't appear in any bucket
- Root cause: Bucket range doesn't cover all values
- Solution: Add an explicit +Inf bucket or check for dropped metrics
Diagnosis: Check the difference between count and sum of bucket counts:
sum(http_request_duration_seconds_count) - sum(http_request_duration_seconds_bucket{le="+Inf"})
(Should be zero; if not, there's a problem)
Problem: Histogram data disappears after service restarts
- Root cause: Counter reset behavior with incorrect query formulation
Solution: Use increase()
or rate()
instead of raw counters:
# Handles counter resets properlysum(increase(http_request_duration_seconds_bucket[5m])) by (le)
Problem: Inconsistent histogram data across service instances
- Root cause: Different bucket configurations between instances
- Solution: Standardize histogram bucket definitions in a shared configuration or library
Resolving Resource Consumption Issues
When histograms consume excessive resources:
Problem: Prometheus storage growing too quickly
- Root cause: Too many histograms with too many buckets
Solution: Implement a histogram bucket reduction strategy:
// Before: 14 buckets[]float64{0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20}// After: 7 strategically chosen buckets[]float64{0.001, 0.01, 0.1, 0.5, 1, 10, 20}
Problem: Histogram instrumentation adds too much overhead
- Root cause: High-frequency observations in critical paths
Solution: Implement adaptive sampling:
sampleRate := 0.01 // Sample 1% by defaultif duration > 1.0 { // But sample 100% of slow requests sampleRate = 1.0}if rand.Float64() <= sampleRate { requestDurationHistogram.Observe(duration)}
Making Histogram Data Actionable at Scale
The techniques in this guide work great until you're managing dozens of services with hundreds of histograms. That's when you need tooling that handles the complexity for you.
Where Teams Hit Walls
- Query complexity: Writing correct
histogram_quantile()
queries for every use case - Bucket optimization: Manually tuning buckets across different service types
- SLO management: Tracking error budgets and burn rates across multiple teams
- Incident correlation: Connecting latency spikes to deployments, infrastructure changes, or external dependencies
How Last9 Extends Prometheus Histograms
- Visual SLO tracking: See error budget consumption and burn rates without writing PromQL
- Smart alerting: Get notified about percentile degradation before users complain
- Automated optimization: Identify inefficient bucket configurations and high-cardinality issues
- Correlation engine: Automatically link latency spikes to deployments, infrastructure metrics, or external service issues
Built for teams using these exact techniques but needing to scale beyond manual PromQL management.
Start for free or talk to us about your use case. We'd be happy to show our platform capabilities and how it can help!
FAQs
What exactly is the difference between histograms and summary metrics in Prometheus?
Histograms and summaries track distribution data differently:
Histograms:
- Store observations in configurable buckets (counters of values ≤ each threshold)
- Calculate percentiles at query time using
histogram_quantile()
- Allow aggregation across multiple instances (crucial for distributed systems)
- Provide flexibility to calculate any percentile without pre-configuration
- Take less client-side CPU but more storage space
- Work well with Prometheus recording rules and alerting
Summaries:
- Pre-calculate percentiles in the client application
- Store specific quantiles (e.g., 0.5, 0.9, 0.99) directly
- Provide more accurate percentiles within single instances
- Cannot be meaningfully aggregated across instances
- Use more client-side resources but less storage
- Have fixed percentiles that can't be changed after collection
Choose histograms when you need cross-instance aggregation or flexible percentile selection. Choose summaries when you need exact percentiles on single instances.
How do I determine the optimal number of buckets for a Prometheus histogram?
The optimal bucket count balances accuracy against resource usage:
- General guideline: 10-15 buckets work well for most applications
- Minimum effective number: At least 7 buckets (to cover 2-3 orders of magnitude)
- Resource-constrained systems: Stick to 7-10 strategically placed buckets
- High-precision requirements: Up to 20-25 buckets, focusing resolution where needed
Focus bucket density around:
- Your SLO thresholds (e.g., more buckets around your p95 target)
- User experience breakpoints (e.g., 100ms, 300ms, 1s)
- Expected operational ranges for your specific service
Remember that each bucket creates a separate time series, so costs grow linearly with bucket count.
When changing histogram bucket definitions, what happens to historical data?
When you modify histogram bucket definitions:
- New time series creation: Prometheus creates entirely new time series for the new buckets
- Historical data limitation: Historical data won't be retroactively available in the new buckets
- Dual maintenance period: You'll need to maintain both old and new histograms during transition
- Recording rule approach: For critical metrics, create recording rules with the old buckets before changing
Best practices for bucket changes:
- Plan bucket layouts carefully before going to production
- When changing is necessary, keep the old metric name for twice your retention period
- Use a new metric name for the new bucket layout (e.g.,
http_request_duration_seconds_v2
) - Create a recording rule that combines old and new data during the transition
What's the most effective way to calculate accurate percentiles from Prometheus histogram buckets?
For accurate percentile calculations:
# Basic p99 calculation
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# For more stability in low-traffic services
histogram_quantile(0.99, sum(increase(http_request_duration_seconds_bucket[10m])) by (le))
# Aggregating across job instances while preserving endpoint dimension
histogram_quantile(0.95, sum by (le, endpoint) (rate(http_request_duration_seconds_bucket[5m])))
To improve accuracy:
- Use more buckets around the percentile you're calculating
- Use longer time windows for stability (5-10m instead of 1m)
- For critical percentiles, create recording rules to ensure calculation consistency
Remember that percentile accuracy is always limited by your bucket layout—more buckets around key percentiles yield better accuracy.
How can high cardinality with Prometheus histograms be effectively managed?
High cardinality management strategies:
- Label discipline:
- Limit high-cardinality labels (like user_id, request_id) from histograms
- Use no more than 2-3 label dimensions per histogram
- Move high-cardinality dimensions to separate metrics when needed
- Bucket optimization:
- Use only necessary buckets (8-12 is often sufficient)
- Standardize bucket layouts across services
- Remove buckets that don't provide actionable insights
- Architecture approaches:
- Implement client-side aggregation for high-volume services
- Use federation or hierarchical Prometheus for large-scale deployments
- Create recording rules for commonly queried percentiles
- Sampling techniques:
- Implement probabilistic sampling for ultra-high-volume services
- Use higher sampling rates for outliers and errors
- Consider exemplar-based approaches for detailed analysis
How can I improve the accuracy of percentiles calculated from histogram buckets?
Percentile accuracy depends on your bucket configuration:
- Add targeted bucket density:
- Place more buckets around critical percentiles (e.g., your p95 or p99 target)
- Example: For a p95 target of 300ms, add buckets at 250ms, 275ms, 300ms, 325ms, 350ms
- Use logarithmic distribution:
- Linear buckets create poor resolution; use exponential/logarithmic spacing
- Evenly distribute bucket density in log-space, not linear space
- Incorporate historical performance:
- Analyze several weeks of data to identify your actual distribution
- Place buckets based on observed percentiles, not theoretical ones
- Evaluate specific service patterns:
- Services with bimodal distributions need buckets covering both modes
- Cache-heavy services need extra resolution in lower latency ranges
The theoretical maximum accuracy is ±(upper_bound - lower_bound)/2 for the bucket containing your percentile.
Beyond request timing, what other metrics benefit from histogram bucket analysis?
Histograms are valuable for many distributions beyond request duration:
- Resource utilization: Memory usage, CPU utilization, disk IOPS
- Queue metrics: Queue depth, time in queue, batch sizes
- Network performance: Packet sizes, network latency, throughput
- Database metrics: Query execution time, connection pool usage, row counts
- Cache performance: Cache hit ratios, time-to-cache, object sizes
- User behavior: Session duration, items per cart, clicks per session
- Message processing: Message size, processing latency, retry counts
- Batch job metrics: Records processed per second, job duration, error rates
- API response sizes: Payload sizes for requests and responses
- Thread pool metrics: Thread usage, task execution time, queue wait time
The pattern applies whenever you need to understand a distribution rather than just averages or totals.
How do I implement histogram bucket monitoring for non-time measurements like request sizes?
For non-time measurements:
- Adjust bucket scales to match data characteristics:
- Memory usage: Consider MB-scale buckets like
[128, 256, 512, 1024, 2048, 4096]
- Queue depth: Use application-appropriate buckets like
[1, 5, 10, 50, 100, 500]
- Message counts: Linear buckets might work better, e.g.,
[10, 20, 50, 100, 250, 500]
- Memory usage: Consider MB-scale buckets like
Create SLOs on size distributions when appropriate:
# Alert when p95 message size exceeds 500KB
histogram_quantile(0.95, sum(rate(message_size_bytes_bucket[5m])) by (le, topic)) > 512000
Observe distribution patterns:
// Track API response sizes
sizeBytes := float64(len(responseData))
responseSizeHistogram.Observe(sizeBytes)
Choose appropriate units and scale:
// For request sizes in bytes, using powers-of-10 scale
requestSizeHistogram := prometheus.NewHistogram(prometheus.HistogramOpts{
Name: "http_request_size_bytes",
Help: "HTTP request size in bytes",
Buckets: []float64{10, 100, 1000, 10000, 100000, 1000000, 10000000},
})