Look, we've all been there - staring at a Prometheus dashboard, trying to figure out why our system's acting up. PromQL can be a pain, but it's also incredibly powerful when you know how to use it.
I've spent countless hours fumbling through queries, and I want to save you some of that hassle. Here's a collection of PromQL snippets that have helped me in the trenches.
Table of Contents
Quick Queries for Common Scenarios
Digging Deeper with Advanced Queries
Real-World Problems and How to Solve Them
Wrapping Up
Quick Queries for Common Scenarios
What's Happening Right Now
When everything's on fire and you need to know what's going on ASAP:
sum(rate(http_requests_total[5m])) by (status_code)
This gives you a quick snapshot of your HTTP requests, broken down by status code. It's saved my bacon more times than I can count.
If you're drowning in data, narrow it down:
topk(5, sum(rate(http_requests_total[5m])) by (status_code))
Now you're just looking at the top 5 status codes. It's like noise-canceling headphones for your metrics.
Tracking System Trends with PromQL
When your boss asks, "How's the system been doing lately?", try this:
avg_over_time(node_memory_MemAvailable_bytes[1h]) / avg_over_time(node_memory_MemTotal_bytes[1h]) * 100
This shows you the average available memory as a percentage over the last hour. It's a quick way to see if you're headed for trouble.
Want to impress them with a day's worth of data? Just tweak it a bit:
avg_over_time(
(avg_over_time(node_memory_MemAvailable_bytes[1h]) / avg_over_time(node_memory_MemTotal_bytes[1h]) * 100)[1d:1h]
)
Now you're looking at hourly averages for the past day. It's like a time-lapse for your memory usage.
Identifying Resource Hogs in Prometheus
When you need to know which service is hogging all the resources:
topk(5, sum by (service) (rate(container_cpu_usage_seconds_total[5m])) / sum by (service) (machine_cpu_cores))
This ranks your services by CPU usage. It's great for finding out which service needs optimization (or which team you need to have a friendly chat with).
Want to compare current usage to last week? Here's a nifty trick:
topk(5, (
sum by (service) (rate(container_cpu_usage_seconds_total[5m])) / sum by (service) (machine_cpu_cores)
) / (
sum by (service) (rate(container_cpu_usage_seconds_total[5m] offset 1w)) / sum by (service) (machine_cpu_cores offset 1w)
))
This shows you which services have had the biggest change in CPU usage compared to last week. Perfect for catching performance regressions or unexpected improvements.
Digging Deeper with Advanced Queries
The "Agg and Match" Technique
Ever needed to find out which client is causing the most errors? Try this:
Copy
sum by (client) (
rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (client) (
rate(http_requests_total[5m])
)
This gives you the error rate per client. It's like having x-ray vision for your API troubles.
The Subquery Shuffle
Here's a beast of a query that finds CPU usage spikes:
max_over_time(
(
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
count by (instance) (node_cpu_seconds_total{mode="idle"})
)[1h:5m]
)
-
min_over_time(
(
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
count by (instance) (node_cpu_seconds_total{mode="idle"})
)[1h:5m]
)
This finds the difference between the max and min CPU usage over the last hour, calculated every 5 minutes. It's like a CPU usage roller coaster detector.
Real-World Problems and How to Solve Them
Prometheus SLO tracking and SLI calculations
When management starts talking about 9's, here's how to keep track:
(sum(rate(http_requests_total{code=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))) * 100
This calculates the availability percentage.
- SLO alert if availability drops below 99.9%:
(sum(rate(http_requests_total{code=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))) < 0.999
This sets an alert if availability is below 99.9%.
- Detecting outages with absent():
absent(up{job="my-critical-service"}) == 1
This checks if a critical service is absent.
Dealing with High Cardinality
When your queries start timing out, it's time to get smart about cardinality:
sum(avg(request_duration_seconds) by (endpoint)) by (endpoint)
This aggregates to reduce cardinality.
histogram_quantile(0.95, sum(rate(request_duration_bucket[5m])) by (le, endpoint))
This calculates the 95th percentile of request durations.
Capacity Planning
Want to look like a fortune teller? Use these for capacity planning:
Predict CPU usage:
predict_linear(node_cpu_seconds_total[30d], 86400 * 7)
This predicts CPU usage 7 days from now.
deriv(node_filesystem_free_bytes{}[7d])
This calculates the growth rate of disk usage.
Cluster-wide predictions:
sum(predict_linear(node_filesystem_free_bytes{}[30d], 86400 * 30)) by (fs_type)
This predicts disk usage across clusters.
Combining Metrics and Logs
When you need to correlate metrics with log events:
# Count of error logs
sum(rate(log_messages_total{level="error"}[5m])) by (service)
# Correlation: HTTP 500 errors vs error log spikes
sum(rate(http_requests_total{status="500"}[5m])) by (service)
/
sum(rate(log_messages_total{level="error"}[5m])) by (service)
Use the group() operator to analyze metrics and log-derived data together:
group(
sum(rate(http_requests_total[5m])) by (service),
sum(rate(log_messages_total{level="error"}[5m])) by (service)
)
Multi-Cluster Queries
For those juggling multiple clusters:
# Query across all clusters
sum(
avg(up{job="kubernetes-nodes"}) by (cluster)
)
# Compare resource usage across clusters
topk(3,
sum(
rate(container_cpu_usage_seconds_total[5m])
) by (cluster)
/
sum(machine_cpu_cores) by (cluster)
)
Pro move: Use recording rules for better performance:
# Recording rule
cluster:container_cpu_usage:percent =
sum(rate(container_cpu_usage_seconds_total[5m])) by (cluster)
/
sum(machine_cpu_cores) by (cluster)
# Query the pre-computed metric
topk(3, cluster:container_cpu_usage:percent)
Wrapping Up
Look, PromQL isn't always fun, but it's a powerful tool when you know how to use it. These queries have helped me out of some tight spots, and I hope they do the same for you. Remember:
- Use instant vectors for "right now" data
- Range vectors are your friend for trends
- Comparisons help you spot what's changed
- SLOs and SLIs keep you honest (and employed)
- High cardinality is the final boss - aggregation is your weapon
- Capacity planning is just educated guessing, but it impresses management
- Metrics and logs are better together
- In a multi-cluster world, federation, and recording rules are lifesavers
Keep these in your back pocket, and the next time someone asks, "What's going on with our system?", you'll have the answers at your fingertips.
Good luck, and happy querying!