So you've got Prometheus up and running, but now you're scratching your head looking at those queries.
PromQL (Prometheus Query Language) looks simple on the surface, but it packs some serious power once you know how to wield it.
Whether you're debugging production issues at 2 AM or building dashboards that actually tell you something useful, these PromQL tricks will upgrade your monitoring game.
What Makes PromQL Different?
PromQL isn't just another query language – it's built specifically for time series data, making it uniquely suited for monitoring metrics. Unlike SQL, PromQL thinks in vectors and instants, not rows and tables.
When you run a PromQL query, you're usually getting back one of these:
- An instant vector (a set of time series, each with a single sample at the same timestamp)
- A range vector (a set of time series with a range of samples over time)
- A scalar (a simple numeric value)
- A string (rarely used, but it's there)
Now let's get into the good stuff.
Trick 1: Master the Rate Function
The rate()
function is your bread and butter for counter metrics. It calculates how fast a counter is increasing per second.
rate(http_requests_total[5m])
This gives you the per-second rate of HTTP requests over the last 5 minutes. But here's the clever part – rate()
handles counter resets gracefully. If your application restarts and the counter goes back to zero, rate()
still gives you accurate numbers.
Pro tip: Pair rate()
with a longer timeframe for stable metrics, and shorter timeframes when you need to spot quick changes.
Trick 2: Use Increase() for Cleaner Numbers
Want to know how many requests you've received in the last hour without doing mental math? That's what increase()
is for:
increase(http_requests_total[1h])
This gives you the total increase in the counter over the specified time – much easier to reason about than per-second rates in some cases.
Trick 3: Turn Gauges into Rates When Needed
While you can't use rate()
directly on gauges, you can track how gauges change over time:
deriv(process_resident_memory_bytes[1h])
This shows you how your memory usage is trending – useful for catching slow memory leaks.
Trick 4: Label Filtering Shortcuts
Filter metrics like a boss with these shorthand tricks:
# Select only production environments
http_requests_total{env="production"}
# Select everything except production
http_requests_total{env!="production"}
# Regex match: all environments starting with "prod"
http_requests_total{env=~"prod.*"}
# Regex exclude: no testing environments
http_requests_total{env!~"test.*"}
Trick 5: The Power of By and Without
Group metrics and clean up your results with by
and without
:
# Group request count by endpoint, dropping other labels
sum by(endpoint) (http_requests_total)
# Sum requests but remove the method label
sum without(method) (http_requests_total)
This keeps your graphs clean and your dashboards meaningful.
Trick 6: Offset for Better Comparisons
Want to compare metrics to last week? Use offset
:
# Current request rate
rate(http_requests_total[5m])
# Request rate one week ago
rate(http_requests_total[5m] offset 1w)
You can even calculate the difference directly:
rate(http_requests_total[5m]) -
rate(http_requests_total[5m] offset 1w)
Trick 7: Use delta() for Gauge Changes
For gauge metrics, delta()
shows you exactly how much the value changed over a period:
# How much did CPU temp change in the last 30m?
delta(cpu_temp_celsius[30m])
This works great for metrics that both increase and decrease.
Trick 8: Alerting on Absent Metrics
What if your metric just... disappears? That's often worse than a bad value. Catch it with:
absent(up{job="api-server"})
This returns 1 if the metric is missing, making it perfect for alerting.
Trick 9: Convert Between Time Units
Need to see results in minutes rather than seconds? Just multiply:
# Request rate per minute instead of per second
rate(http_requests_total[5m]) * 60
Or for hours:
rate(http_requests_total[5m]) * 3600
Trick 10: Calculate Percentiles the Right Way
Don't calculate percentiles from already-aggregated data. Use Prometheus's built-in histogram quantiles:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
This gives you accurate 95th percentile latency from histogram metrics.
Trick 11: Binary Operators for Complex Comparisons
Mix and match metrics with binary operators:
# Find when error rates exceed 5% of total requests
rate(http_requests_error_total[5m]) > 0.05 * rate(http_requests_total[5m])
Trick 12: Use Subqueries for Moving Averages
Smooth out noisy metrics with a moving average:
avg_over_time(rate(http_requests_total[5m])[1h:5m])
This gives you the average rate calculated over a sliding 1-hour window, sampled every 5 minutes.
Trick 13: The Unless Operator
The unless
operator is your friend for filtering out expected cases:
# Find instances that are down, unless they're in maintenance mode
up == 0 unless maintenance == 1
Trick 14: Time() for Dynamic Thresholds
Use the time()
function for dynamic, time-based checks:
# Different disk space alerts during business hours vs. overnight
disk_used_percent > 80 and (hour() >= 9 and hour() < 17)
Trick 15: Create On-the-Fly Metrics with Vector Matching
Need a custom metric that doesn't exist? Create it by matching two metrics:
# Calculate error percentage on the fly
rate(http_requests_error_total[5m]) /
ignoring(status)
rate(http_requests_total[5m]) * 100
The ignoring(status)
part helps when labels don't perfectly match.
Trick 16: Sort and Limit for Top-N Queries
Focus on your biggest consumers with sorting:
# Top 5 memory-hungry pods
topk(5, container_memory_usage_bytes{namespace="production"})
Trick 17: Use predict_linear() for Trend Forecasting
Want to know if you'll run out of disk space in the next 24 hours?
# Predict disk free in 24 hours based on 6h of data
predict_linear(node_filesystem_free_bytes[6h], 86400) < 0
This returns 1 if you're projected to run out of space, making it perfect for alerting.
Trick 18: Dealing with Counter Resets Manually
Sometimes you need more control than rate()
provides:
# Handle counter resets with explicit reset detection
changes(http_requests_total[1h]) > 1
This helps you identify when counters are being reset more often than expected.
Trick 19: Label_replace for Dynamic Relabeling
Transform your labels on the fly:
# Extract service name from a longer identifier
label_replace(metric_name, "service", "$1", "pod", "(.*)-[a-z0-9]+-[a-z0-9]+")
Trick 20: Use clamp_min() and clamp_max() for Cleaner Graphs
Outliers can make graphs unreadable. Tame them with:
# Cap CPU usage visualization at 100%
clamp_max(cpu_usage_percent, 100)
# Ensure values don't go below zero
clamp_min(temperature_celsius, 0)
Trick 21: Holt Winters for Smarter Predictions
For more accurate predictions that account for trends and seasonality:
holt_winters(rate(http_requests_total[1d])[30d:1d], 0.3, 0.3)
This gives you a weighted prediction that's more accurate than simple linear forecasting.
How to Put These PromQL Tricks to Work
The real power comes when you combine these techniques. For example:
Scenario | PromQL Query |
---|---|
Alert on Error Spike | rate(errors[5m]) > 3 * rate(errors[5m] offset 1h) |
Track Weekly Patterns | rate(requests[1h]) / rate(requests[1h] offset 7d) |
Forecast Resource Needs | predict_linear(cpu_usage[12h], 24 * 3600) / cpu_limit |
These combinations help you build dashboards that tell stories, not just display numbers.
Metric Type | Best PromQL Function |
---|---|
Counters | rate() , increase() |
Gauges | avg_over_time() , delta() |
Histograms | histogram_quantile() |
Wrapping Up
PromQL might seem strange at first if you're coming from SQL or other query languages, but its unique approach makes it incredibly powerful for time-series monitoring.