The sum_over_time()
function in Prometheus gives you a way to aggregate counter resets, gauge fluctuations, and histogram samples across specific time windows. Instead of seeing point-in-time values, you get the cumulative total of all data points within your chosen range—useful for calculating totals from rate data, tracking accumulated errors, or understanding resource consumption patterns over custom intervals.
This function works with range vectors, not instant vectors, so you'll always pair it with a time selector like [5m]
or [1h]
. The result is a single aggregated value per time series that represents the sum of all raw samples in that window.
sum_over_time() Syntax and Behavior
The basic syntax for sum_over_time()
follows this pattern:
sum_over_time(<range_vector>)
The function requires a range vector, which means you need a time selector in square brackets. The step size in your range vector affects both precision and performance:
# High precision - every 30 seconds over 1 hour
sum_over_time(metric_name[1h:30s])
# Standard precision - every 1 minute over 1 hour
sum_over_time(metric_name[1h:1m])
# Lower precision - every 5 minutes over 1 hour
sum_over_time(metric_name[1h:5m])
Choose your step size based on your scrape interval and analysis needs. A good rule of thumb is to use a step size that's equal to or a multiple of your scrape interval.
sum_over_time()
fit into the bigger picture, this Prometheus functions blog breaks it down with practical examples.How sum_over_time()
Works with Different Metric Types
Counter Metrics
Counters naturally increment over time, but sum_over_time()
adds up the raw counter values, not the rate of increase. This becomes useful when you want totals from already-calculated rates:
# Sum the per-second HTTP request rates over the last hour
sum_over_time(rate(http_requests_total[1m])[1h:1m])
# Total bytes sent across all instances in the last 5 minutes
sum_over_time(rate(network_bytes_sent_total[1m])[5m:1m])
The key here is that you're typically summing rate calculations, not raw counter values. Raw counters give you cumulative totals since the process started, which rarely match what you need for time-based analysis.
Gauge Metrics
Gauges represent current values that can go up or down. With sum_over_time()
, you're adding up all the individual measurements:
# Total CPU usage across all samples in the last 10 minutes
sum_over_time(cpu_usage_percent[10m])
# Accumulated memory pressure over the last hour
sum_over_time(memory_pressure_bytes[1h])
This works well for understanding cumulative resource consumption or total utilization across a time window, rather than just the current state.
Histogram Metrics
Histograms track distributions of values. You can use sum_over_time()
on histogram buckets or summary quantiles:
# Total samples in the 95th percentile bucket over 30 minutes
sum_over_time(http_request_duration_seconds_bucket{le="0.5"}[30m])
# Sum of all histogram observations (using _sum metric)
sum_over_time(http_request_duration_seconds_sum[15m])
Practical Use Cases for Time-Based Aggregation
sum_over_time
becomes essential when you need to calculate totals or cumulative values across specific time windows. Below are a few common scenarios where it adds real value.
1. Calculating Custom Totals from Rate Data
You often need to measure activity within specific windows—like business hours, deployment periods, or peak load intervals.
# Total requests during business hours (9 AM to 5 PM)
sum_over_time(rate(http_requests_total[1m])[8h:1m] @ start())
This query computes the total number of requests over an 8-hour window, sampled every minute. The @ start()
modifier ensures that the window starts exactly at a predictable point (e.g., 9 AM) for alignment with business hours.
# Error count during the last deployment window
sum_over_time(rate(http_errors_total[1m])[30m:1m] offset 2h)
Here, we’re looking at error volume during a 30-minute deployment window that occurred 2 hours ago. Using offset
lets you shift the aggregation window back in time—useful for comparing before/after effects of deployments.
2. Resource Consumption Analysis
Want to know how much disk or network traffic occurred over a period? Combine rate
with sum_over_time
for cumulative totals.
# Total disk I/O over the last hour, sampled every minute
sum_over_time(rate(disk_io_bytes_total[1m])[1h:1m])
This gives you total bytes read/written to disk across the last hour, with fine-grained sampling to catch spikes.
# Accumulated network traffic during peak hours
sum_over_time(rate(network_bytes_total[1m])[3h:1m] @ start())
This query aligns with a fixed 3-hour peak usage window (e.g., 6 PM to 9 PM), so you're not averaging across misaligned periods.
3. SLA and Budget Calculations
When calculating SLAs or tracking error budgets, precision matters—especially when you’re aggregating over days or weeks.
# Total successful requests in the SLA measurement window
sum_over_time(rate(http_requests_total{status=~"2.."}[1m])[24h:1m])
This totals all successful HTTP requests (status codes 2xx) over the past 24 hours. A key metric for uptime and reliability reporting.
# Total error budget consumption over the last week
sum_over_time(rate(sli_errors_total[5m])[7d:5m])
This aggregates SLI failures over the last 7 days to show how much of the error budget has been used.
sum_over_time()
, check out this Prometheus query examples blog.sum_over_time()
vs Other Aggregation Functions
Choosing the right time-based aggregation function depends on the kind of analysis you're doing—whether it's tracking totals, smoothing trends, or analyzing data frequency.
Here's how sum_over_time()
compares to related functions:
sum_over_time()
vs avg_over_time()
# Total accumulated CPU usage over 10 minutes
sum_over_time(cpu_usage_percent[10m])
# Average CPU usage across the same window
avg_over_time(cpu_usage_percent[10m])
- Use
**sum_over_time**
when you need cumulative totals, e.g., total usage, total traffic, or total events in a window. - Use
**avg_over_time**
when you're looking to smooth out short-term spikes and get a sense of the typical value over time.
sum_over_time()
vs increase()
# Sum of all raw samples over the last 5 minutes
sum_over_time(http_requests_total[5m])
# Net increase in counter value over the last 5 minutes
increase(http_requests_total[5m])
**increase()**
is built for counter metrics, it accounts for resets and provides the true delta.**sum_over_time()**
simply adds up raw samples. With counters, this can lead to misleading results, especially if there's a reset in the window.
Use increase()
for most counter-based use cases.
sum_over_time()
vs count_over_time()
# Sum of response times over the last 5 minutes
sum_over_time(response_time_seconds[5m])
# Number of data points collected in that window
count_over_time(response_time_seconds[5m])
**count_over_time()**
is useful for understanding data availability or scrape frequency.- You can combine both to calculate averages manually:
sum_over_time(metric[5m]) / count_over_time(metric[5m])
This is handy when avg_over_time()
isn't flexible enough for derived metrics.
Advanced Patterns with sum_over_time()
Once you're comfortable using sum_over_time()
for basic totals, there are a few advanced patterns that unlock more flexibility and insight. Here’s how to get more out of it:
1. Combining with Other Aggregations
You can wrap sum_over_time()
inside other functions to analyze totals across dimensions or services:
# Average total CPU usage across instances
avg(sum_over_time(cpu_usage[10m])) by (instance)
# Maximum total errors across all services over the last hour
max(sum_over_time(rate(errors_total[1m])[1h:1m])) by (service)
This pattern is useful when you want to group or compare time-windowed totals across labels like instance
, region
, or service
.
2. Using Subqueries for Complex Windows
Subqueries allow you to define rolling windows and perform calculations at regular steps, enabling richer temporal analysis.
# Rolling 6-hour sum, evaluated every hour
sum_over_time(
rate(requests_total[5m])[6h:1h]
)
# Peak 95th percentile latency in 15-minute windows over the last day
max_over_time(
sum_over_time(
histogram_quantile(0.95, rate(latency_bucket[5m]))
[15m:5m]
)
[24h:15m]
)
These patterns help surface long-term trends, spot outliers, and understand peak behavior during specific intervals.
3. Building Recording Rules
If you rely on sum_over_time()
in multiple places, dashboards, alerts, or queries, recording rules let you precompute those results for faster access.
groups:
- name: business_metrics
interval: 1m
rules:
- record: requests:sum_5m
expr: sum_over_time(rate(http_requests_total[1m])[5m:1m])
- record: errors:sum_hourly
expr: sum_over_time(rate(http_errors_total[1m])[1h:1m])
Now you can reference the recorded metrics directly:
# Use in dashboards
requests:sum_5m
# Alert if request volume spikes
requests:sum_5m > 1000
This reduces load on your Prometheus server and improves query responsiveness, especially with longer time ranges.
4. Time Alignment and Precision
The step parameter in range subqueries (e.g., [1h:30s]
) controls the sampling resolution. Smaller steps give you higher granularity, but at a higher cost.
# High precision - 30-second resolution
sum_over_time(rate(metric[1m])[1h:30s])
# Lower precision - 5-minute resolution
sum_over_time(rate(metric[1m])[1h:5m])
For most setups, choose a step size that aligns with your scrape interval or a multiple of it. This ensures predictable and consistent query behavior without unnecessary overhead.
sum_over_time()
in scripts or dashboards? This Prometheus API blog walks through how to query it programmatically.Handle Gaps and Stale Series in sum_over_time()
Prometheus automatically marks a time series as stale if it doesn’t receive a new sample within 5× the scrape interval. When this happens, functions like sum_over_time()
skip those ranges entirely, which can lead to missing or incomplete results.
This behavior matters when working with metrics that report intermittently—cron jobs, batch processes, or remote exporters.
Staleness Behavior
sum_over_time(metric_name[5m])
This returns a value only if there’s at least one data point in the 5-minute window. If the time series is stale or hasn’t been reported recently, the range vector is empty, and the result is excluded.
Practical Workarounds
To avoid issues from missing samples, apply one of these strategies:
# Expand the time window to account for infrequent sampling
sum_over_time(metric_name[15m])
# Return 0 if the series doesn't exist in the window
sum_over_time(metric_name[5m]) or vector(0)
# Check if the series was present in the window
present_over_time(metric_name[5m])
- Use a wider window (
[15m]
,[30m]
) if you expect delayed samples. or vector(0)
ensures the query doesn't return nothing when the time series is missing.present_over_time()
returns1
if any value was present, allowing you to detect stale vs. inactive metrics.
These techniques are especially useful in dashboards, where a lack of data can be misinterpreted as a drop to zero. Use them to build more accurate panels and alerts when metric intervals aren't consistent.
Performance Considerations and Optimization
Using sum_over_time()
at scale? A few query patterns can quickly chew through memory and slow down your dashboards. Let’s break down what affects performance—and how to optimize it.
Time Range and Step Size
How many data points you're aggregating directly impacts query time.
# Expensive: 720 points (12h window, 1m step)
sum_over_time(metric[12h:1m])
# More efficient: 144 points (12h window, 5m step)
sum_over_time(metric[12h:5m])
Larger ranges + smaller steps = more data in memory. Unless you really need fine-grained resolution, increase your step size to reduce load.
Label Cardinality
The more unique label combinations you query over, the harder Prometheus has to work.
# High cardinality — slow and memory-heavy
sum(sum_over_time(requests_total[5m])) by (user_id)
# Lower cardinality — much faster
sum(sum_over_time(requests_total[5m])) by (service)
Stick to group by
labels with fewer unique values (like service
, region
, or env
) when possible. Avoid using user identifiers, IPs, or request-level labels unless absolutely necessary.
Memory Usage Patterns
Prometheus doesn’t stream values—it loads the entire time range into memory before computing sum_over_time()
. A few patterns to watch out for:
- Wide time ranges + high-cardinality metrics = memory spikes
- Nested aggregations (e.g.,
sum(sum_over_time(...))
) multiply memory usage - Step sizes smaller than your scrape interval won’t increase accuracy—but they will increase cost
If queries start failing or dashboards timeout, revisit these patterns first.
Optimization Strategies
Tuning sum_over_time()
isn’t just about writing shorter queries—it’s about writing smarter ones. These small adjustments can save you serious compute.
1. Use Recording Rules for Expensive Aggregations
If you're calculating the same thing over and over (especially across dashboards), record it once and query the result.
# Precompute 5-minute request sums
- record: http:requests_sum_5m
expr: sum_over_time(rate(http_requests_total[1m])[5m:1m])
Recording rules move the heavy lifting to ingestion time—so your dashboards stay snappy.
2. Filter Early, Then Aggregate
Don’t wait until after aggregation to apply filters. Reduce the dataset before doing the math:
# Filter before summing
sum_over_time(rate(http_requests_total{service="api"}[1m])[5m:1m])
This cuts down the number of time series being fetched and processed.
3. Match Step Size to Scrape Interval
Step size controls how much data you include in each aggregation chunk. Too small, and you burn memory; too large, and you lose visibility.
Rule of thumb: use your scrape interval or a small multiple of it (e.g., 1m, 2m, 5m).
# Reasonable step size
sum_over_time(metric[30m:2m])
Unless you’re troubleshooting spikes at sub-minute precision, there's rarely a need to go smaller.
sum_over_time()
queries aren't returning what you expect.Troubleshooting sum_over_time()
Queries
No Data Points Returned
- Check if the metric exists within the selected time range using a simple query like
metric_name
. - Make sure your query window falls within Prometheus’ data retention period.
- Ensure there’s at least one data point in the selected time window.
Unexpectedly Low Values
- Use
count_over_time(metric[5m])
to check for gaps in scraped data. - Confirm you’re not mixing counters and gauges in the same query.
- For counter metrics, account for resets that may affect totals.
Memory or Timeout Errors
- Reduce the time range or increase the step size in your range vector.
- Add more specific label filters to reduce the number of time series.
- Use recording rules to offload complex or repeated calculations.
Inconsistent Results Across Runs
- Check for clock drift between Prometheus and your scrape targets.
- Ensure scrape intervals are consistent across your jobs.
- Use the
@ start()
modifier to align query time windows precisely.
Final Thoughts
sum_over_time()
is essential when you need to reason about trends, rates, or cumulative behavior over fixed time windows. But as your metrics grow—especially with high-cardinality dimensions—these queries can become slow, memory-intensive, or inconsistent.
Last9 is built to solve that problem. Our platform ingests Prometheus metrics, applies real-time streaming aggregation, and stores precomputed views so queries like sum_over_time()
return instantly, even across millions of time series. That means you don’t have to trade off between granularity and performance.
You can plug in your existing Prometheus exporters, define recording rules, and start querying aggregated metrics with consistent latency. No scraping changes, no config rewrites.
For example:
groups:
- name: business_metrics
rules:
- record: api:requests:1h
expr: sum_over_time(rate(api_requests_total[1m])[1h:1m])
Once recorded, this metric can power fast dashboards, SLOs, and alerts—all without overloading your Prometheus setup.
If you're running into limits with native Prometheus or just want observability that scales with your telemetry volume, it's worth exploring how Last9 handles this out of the box.
FAQs
Q: What's the difference between sum_over_time()
and sum()
?
A: sum()
aggregates across different time series at a single point in time. sum_over_time()
aggregates across multiple time points for the same time series.
Q: Can I use sum_over_time()
with instant vectors?
A: No, sum_over_time()
only works with range vectors. You need a time selector like [5m]
in your query.
Q: How does sum_over_time()
handle missing data points?
A: Missing data points are ignored. Prometheus considers data "stale" if there's a gap longer than 5× the scrape interval—these stale points are excluded from the sum.
Q: What happens if I use sum_over_time()
on a counter that resets?
A: It sums the raw values—including post-reset values—so totals may be inaccurate. For counters, use rate()
or increase()
first, then apply sum_over_time()
if needed.
Q: How do I choose the right step size for my range vector?
A: Start with your scrape interval or a multiple of it. Smaller steps increase resolution but consume more resources. A 1m
to 5m
step size works for most cases.
Q: Can sum_over_time()
cause high memory usage?
A: Yes. Long time ranges + small step sizes + high-cardinality metrics = memory spikes. Reduce query size or pre-aggregate with recording rules.
Q: When should I use recording rules with sum_over_time()
?
A: Use them when running the same heavy sum_over_time()
queries in dashboards or alerts. Recording rules precompute the results and speed up response times.
Q: How do I debug slow sum_over_time()
queries?
A:
- Use
count_over_time()
to estimate data points - Limit the time range
- Increase the step size
- Filter labels to reduce cardinality
- Start with simpler queries and scale up