Look, we've all been there - staring at a Prometheus dashboard, trying to figure out why our system's acting up. PromQL can be a pain, but it's also incredibly powerful when you know how to use it.
I've spent countless hours fumbling through queries, and I want to save you some of that hassle.
Here's a collection of PromQL snippets that have helped me in the trenches.
Quick Queries for Common Scenarios
What's Happening Right Now
When everything's on fire and you need to know what's going on ASAP:
sum(rate(http_requests_total[5m])) by (status_code)
This gives you a quick snapshot of your HTTP requests, broken down by status code. It's saved my bacon more times than I can count.
If you're drowning in data, narrow it down:
topk(5, sum(rate(http_requests_total[5m])) by (status_code))
Now you're just looking at the top 5 status codes. It's like noise-canceling headphones for your metrics.
Tracking System Trends with PromQL
When your boss asks, "How's the system been doing lately?", try this:
avg_over_time(node_memory_MemAvailable_bytes[1h]) / avg_over_time(node_memory_MemTotal_bytes[1h]) * 100
This shows you the average available memory as a percentage over the last hour. It's a quick way to see if you're headed for trouble.
Want to impress them with a day's worth of data? Just tweak it a bit:
avg_over_time(
(avg_over_time(node_memory_MemAvailable_bytes[1h]) / avg_over_time(node_memory_MemTotal_bytes[1h]) * 100)[1d:1h]
)
Now you're looking at hourly averages for the past day. It's like a time-lapse for your memory usage.
Identifying Resource Hogs in Prometheus
When you need to know which service is hogging all the resources:
topk(5, sum by (service) (rate(container_cpu_usage_seconds_total[5m])) / sum by (service) (machine_cpu_cores))
This ranks your services by CPU usage. It's great for finding out which service needs optimization (or which team you need to have a friendly chat with).
Want to compare current usage to last week? Here's a nifty trick:
topk(5, (
sum by (service) (rate(container_cpu_usage_seconds_total[5m])) / sum by (service) (machine_cpu_cores)
) / (
sum by (service) (rate(container_cpu_usage_seconds_total[5m] offset 1w)) / sum by (service) (machine_cpu_cores offset 1w)
))
This shows you which services have had the biggest change in CPU usage compared to last week. Perfect for catching performance regressions or unexpected improvements.
Digging Deeper with Advanced Queries
The "Agg and Match" Technique
Ever needed to find out which client is causing the most errors? Try this:
Copy
sum by (client) (
rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (client) (
rate(http_requests_total[5m])
)
This gives you the error rate per client. It's like having x-ray vision for your API troubles.
The Subquery Shuffle
Here's a beast of a query that finds CPU usage spikes:
max_over_time(
(
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
count by (instance) (node_cpu_seconds_total{mode="idle"})
)[1h:5m]
)
-
min_over_time(
(
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
count by (instance) (node_cpu_seconds_total{mode="idle"})
)[1h:5m]
)
This finds the difference between the max and min CPU usage over the last hour, calculated every 5 minutes. It's like a CPU usage roller coaster detector.
Aggregation Functions in PromQL
PromQL offers several specialized aggregation functions to help you analyze and summarize your time series data efficiently.
These functions are perfect for working with data in groups and applying different types of mathematical or statistical operations.
1. avg()
The avg()
function computes the average value of a time series over a specified period. It's often used to find the mean of values across a set of time series.
Example:
avg(http_requests_total{status="200"})
This query returns the average number of HTTP requests that returned a status code of 200.
2. count()
The count()
function counts the number of elements in a time series group. It's useful when you want to know how many time series match a given filter.
Example:
count(http_requests_total)
This query counts how many HTTP request metrics are being recorded.
3. max()
The max()
function returns the maximum value across a group of time series. It’s handy when you want to track the highest value within a group.
Example:
max(cpu_usage_seconds_total)
This query shows the highest CPU usage across all available time series.
4. min()
The min()
function returns the minimum value across a group of time series. It’s great for finding the lowest measurement.
Example:
min(memory_usage_bytes)
This query provides the minimum memory usage across all instances.
5. sum()
The sum()
function calculates the total sum of values across a group of time series. It's often used to get an overall sum of metrics like bytes or requests.
Example:
sum(network_bytes_total)
This query sums up the total network traffic across all instances.
6. stddev()
The stddev()
function calculates the standard deviation of a time series. It's useful for measuring the variability or spread of data.
Example:
stddev(request_duration_seconds)
This query computes the standard deviation of request durations to measure their variability.
7. stdvar()
The stdvar()
function returns the variance of a set of time series. Variance is the square of the standard deviation and measures the spread of values.
Example:
stdvar(request_duration_seconds)
This query returns the variance of the request durations.
8. last_over_time()
The last_over_time()
function returns the last recorded value of a time series within a given time window. It’s useful when you want the most recent value.
Example:
This query gives the most recent HTTP request count over the last hour.
9. quantile()
The quantile()
function calculates a specified quantile (percentile) of a time series, helping you find values at specific percentiles.
Example:
quantile(0.95, request_duration_seconds)
This query calculates the 95th percentile of request durations.
Label Manipulation Techniques in PromQL
In PromQL, labels play a key role in identifying and grouping time series. Label manipulation allows you to work with time series based on specific labels or even modify them to perform more refined queries.
Here are some techniques for manipulating labels in PromQL:
1. Label Filtering
Label filtering allows you to select time series that match specific label values. You can filter based on one or more labels, enabling targeted queries.
Example:
http_requests_total{job="api", status="500"}
This query retrieves an all-time series where the job
label is api
and the status
label is 500
.
2. Label Matching with Regular Expressions
You can use regular expressions to match label values that fit a particular pattern. This is useful when you want to query multiple label values that follow a common naming scheme.
Example:
http_requests_total{status=~"5.*"}
This query matches all-time series where the status
label starts with 5
(e.g., 500, 503, etc.).
3. Label Exclusion
To exclude time series with a specific label value, use the !=
operator.
Example:
http_requests_total{status!="200"}
4. Adding or Removing Labels
PromQL doesn’t allow you to add new labels to the time series directly, but you can simulate this using the label_replace()
function. This can also be used to remove labels or modify existing ones.
Label Join Functionality in PromQL
PromQL’s label join functionality allows you to combine or match time series from different metrics based on shared labels. This can be done using the on()
and group_left()
/ group_right()
operators.
1. Joining on Specific Labels
The on()
keyword allows you to specify which labels to match when joining a time series. It ensures that only time series with the same label values are combined.
Example:
http_requests_total{job="api"} * on(instance) http_errors_total{job="api"}
This query joins the http_requests_total
and http_errors_total
time series where the instance
label matches. It multiplies the two metrics for the same instance, allowing you to compute an error rate.
2. Using group_left() and group_right()
Sometimes, you might want to join time series from two different metrics, but they don’t have matching labels. In that case, you can use group_left()
or group_right()
to indicate how to join them.
group_left()
allows the left-hand side series to have multiple time series for each matching label.group_right()
does the same for the right-hand side series.
Example:
up{job="api"} * on(instance) group_left(region) http_requests_total{job="api"}
This query joins the up
metric with the http_requests_total
metric on the instance
label, but it adds the region
label from the http_requests_total
series to the result.
Label Extraction with Examples
Label extraction in PromQL can be performed using the label_replace()
function. This function allows you to manipulate label values, extract substrings, or even reformat them.
Syntax of label_replace():
label_replace(metric, "label_name", "replacement", "source_label", "regex")
Where:
metric
: The time series to operate on.label_name
: The label to create or modify.replacement
: The new value to assign.source_label
: The label from which to extract data.regex
: A regular expression to match the part of the source label to extract.
1. Extracting Parts of a Label
You can extract part of an existing label value and store it in a new label. For example, if a label contains a full name and you only want to extract the first name:
Example:
label_replace(http_requests_total, "method", "$1", "path", "(GET|POST)")
2. Reformatting a Label
If you need to modify the format of a label value, you can use regular expressions with label_replace()
. For example, if you want to change http_requests_total
values that are in camelCase to snake_case:
Example:
label_replace(http_requests_total, "method", "$1", "path", "(get|post|put|delete)")
This query would replace all occurrences of the HTTP method with the appropriate label value, potentially reformatting the existing string.
3. Removing a Label
You can also remove a label by using label_replace()
with an empty replacement value. This essentially drops the label from the metric.
Example:
label_replace(http_requests_total, "method", "", "path", ".*")
This removes the method
label from the http_requests_total
metric based on a regular expression that matches all values.
Real-World Problems and How to Solve Them
Prometheus SLO tracking and SLI calculations
When management starts talking about 9's, here's how to keep track:
- Availability percentage:
(sum(rate(http_requests_total{code=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))) * 100
This calculates the availability percentage.
- SLO alert if availability drops below 99.9%:
(sum(rate(http_requests_total{code=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))) < 0.999
This sets an alert if availability is below 99.9%.
- Detecting outages with absent():
absent(up{job="my-critical-service"}) == 1
This checks if a critical service is absent.
Dealing with High Cardinality
When your queries start timing out, it's time to get smart about cardinality:
- Reduce cardinality:
sum(avg(request_duration_seconds) by (endpoint)) by (endpoint)
This aggregates to reduce cardinality.
- Percentiles:
histogram_quantile(0.95, sum(rate(request_duration_bucket[5m])) by (le, endpoint))
This calculates the 95th percentile of request durations.
Capacity Planning
Want to look like a fortune teller? Use these for capacity planning:
Predict CPU usage:
- 7-day prediction:
predict_linear(node_cpu_seconds_total[30d], 86400 * 7)
This predicts CPU usage 7 days from now.
- Disk usage growth rate:
deriv(node_filesystem_free_bytes{}[7d])
This calculates the growth rate of disk usage.
Cluster-wide predictions:
sum(predict_linear(node_filesystem_free_bytes{}[30d], 86400 * 30)) by (fs_type)
This predicts disk usage across clusters.
Combining Metrics and Logs
When you need to correlate metrics with log events:
# Count of error logs
sum(rate(log_messages_total{level="error"}[5m])) by (service)
# Correlation: HTTP 500 errors vs error log spikes
sum(rate(http_requests_total{status="500"}[5m])) by (service)
/
sum(rate(log_messages_total{level="error"}[5m])) by (service)
Use the group() operator to analyze metrics and log-derived data together:
group(
sum(rate(http_requests_total[5m])) by (service),
sum(rate(log_messages_total{level="error"}[5m])) by (service)
)
Multi-Cluster Queries
For those juggling multiple clusters:
# Query across all clusters
sum(
avg(up{job="kubernetes-nodes"}) by (cluster)
)
# Compare resource usage across clusters
topk(3,
sum(
rate(container_cpu_usage_seconds_total[5m])
) by (cluster)
/
sum(machine_cpu_cores) by (cluster)
)
Pro move: Use recording rules for better performance:
# Recording rule
cluster:container_cpu_usage:percent =
sum(rate(container_cpu_usage_seconds_total[5m])) by (cluster)
/
sum(machine_cpu_cores) by (cluster)
# Query the pre-computed metric
topk(3, cluster:container_cpu_usage:percent)
Wrapping Up
Look, PromQL isn't always fun, but it's a powerful tool when you know how to use it. These queries have helped me out of some tight spots, and I hope they do the same for you. Remember:
- Use instant vectors for "right now" data
- Range vectors are your friend for trends
- Comparisons help you spot what's changed
- SLOs and SLIs keep you honest (and employed)
- High cardinality is the final boss - aggregation is your weapon
- Capacity planning is just educated guessing, but it impresses management
- Metrics and logs are better together
- In a multi-cluster world, federation, and recording rules are lifesavers
Keep these in your back pocket, and the next time someone asks, "What's going on with our system?", you'll have the answers at your fingertips.
Good luck, and happy querying!