Last9 Last9

Sep 12th, ‘24 / 2 min read

PromQL Cheat Sheet: Must-Know PromQL Queries

This cheat sheet provides practical guidance for diagnosing issues and understanding trends.

PromQL Cheat Sheet: Must-Know PromQL Queries

Look, we've all been there - staring at a Prometheus dashboard, trying to figure out why our system's acting up. PromQL can be a pain, but it's also incredibly powerful when you know how to use it.

I've spent countless hours fumbling through queries, and I want to save you some of that hassle.

Here's a collection of PromQL snippets that have helped me in the trenches.

Quick Queries for Common Scenarios

What's Happening Right Now

When everything's on fire and you need to know what's going on ASAP:

sum(rate(http_requests_total[5m])) by (status_code)

This gives you a quick snapshot of your HTTP requests, broken down by status code. It's saved my bacon more times than I can count.

If you're drowning in data, narrow it down:

topk(5, sum(rate(http_requests_total[5m])) by (status_code))

Now you're just looking at the top 5 status codes. It's like noise-canceling headphones for your metrics.

📂
Also explore our guide on PromQL: A Developer's Guide to Mastering Prometheus Queries.

When your boss asks, "How's the system been doing lately?", try this:

avg_over_time(node_memory_MemAvailable_bytes[1h]) / avg_over_time(node_memory_MemTotal_bytes[1h]) * 100

This shows you the average available memory as a percentage over the last hour. It's a quick way to see if you're headed for trouble.

Want to impress them with a day's worth of data? Just tweak it a bit:

avg_over_time(
  (avg_over_time(node_memory_MemAvailable_bytes[1h]) / avg_over_time(node_memory_MemTotal_bytes[1h]) * 100)[1d:1h]
)

Now you're looking at hourly averages for the past day. It's like a time-lapse for your memory usage.

Identifying Resource Hogs in Prometheus

When you need to know which service is hogging all the resources:

topk(5, sum by (service) (rate(container_cpu_usage_seconds_total[5m])) / sum by (service) (machine_cpu_cores))

This ranks your services by CPU usage. It's great for finding out which service needs optimization (or which team you need to have a friendly chat with).

Want to compare current usage to last week? Here's a nifty trick:

topk(5, (
  sum by (service) (rate(container_cpu_usage_seconds_total[5m])) / sum by (service) (machine_cpu_cores)
) / (
  sum by (service) (rate(container_cpu_usage_seconds_total[5m] offset 1w)) / sum by (service) (machine_cpu_cores offset 1w)
))

This shows you which services have had the biggest change in CPU usage compared to last week. Perfect for catching performance regressions or unexpected improvements.

📑
Also read: Troubleshooting Common Prometheus Pitfalls—Cardinality, Resource Utilization, and Storage Issues

Digging Deeper with Advanced Queries

The "Agg and Match" Technique

Ever needed to find out which client is causing the most errors? Try this:

Copy
sum by (client) (
  rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (client) (
  rate(http_requests_total[5m])
)

This gives you the error rate per client. It's like having x-ray vision for your API troubles.

The Subquery Shuffle

Here's a beast of a query that finds CPU usage spikes:

max_over_time(
  (
    sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
    /
    count by (instance) (node_cpu_seconds_total{mode="idle"})
  )[1h:5m]
)
-
min_over_time(
  (
    sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
    /
    count by (instance) (node_cpu_seconds_total{mode="idle"})
  )[1h:5m]
)

This finds the difference between the max and min CPU usage over the last hour, calculated every 5 minutes. It's like a CPU usage roller coaster detector.

Aggregation Functions in PromQL

PromQL offers several specialized aggregation functions to help you analyze and summarize your time series data efficiently.

These functions are perfect for working with data in groups and applying different types of mathematical or statistical operations.

1. avg()

The avg() function computes the average value of a time series over a specified period. It's often used to find the mean of values across a set of time series.

Example:

avg(http_requests_total{status="200"})

This query returns the average number of HTTP requests that returned a status code of 200.

2. count()

The count() function counts the number of elements in a time series group. It's useful when you want to know how many time series match a given filter.

Example:

count(http_requests_total)

This query counts how many HTTP request metrics are being recorded.

3. max()

The max() function returns the maximum value across a group of time series. It’s handy when you want to track the highest value within a group.

Example:

max(cpu_usage_seconds_total)

This query shows the highest CPU usage across all available time series.

4. min()

The min() function returns the minimum value across a group of time series. It’s great for finding the lowest measurement.

Example:

min(memory_usage_bytes)

This query provides the minimum memory usage across all instances.

5. sum()

The sum() function calculates the total sum of values across a group of time series. It's often used to get an overall sum of metrics like bytes or requests.

Example:

sum(network_bytes_total)

This query sums up the total network traffic across all instances.

6. stddev()

The stddev() function calculates the standard deviation of a time series. It's useful for measuring the variability or spread of data.

Example:

stddev(request_duration_seconds)

This query computes the standard deviation of request durations to measure their variability.

📝
Read more about PromQL Macros here!

7. stdvar()

The stdvar() function returns the variance of a set of time series. Variance is the square of the standard deviation and measures the spread of values.

Example:

stdvar(request_duration_seconds)

This query returns the variance of the request durations.

8. last_over_time()

The last_over_time() function returns the last recorded value of a time series within a given time window. It’s useful when you want the most recent value.

Example:

This query gives the most recent HTTP request count over the last hour.

9. quantile()

The quantile() function calculates a specified quantile (percentile) of a time series, helping you find values at specific percentiles.

Example:

quantile(0.95, request_duration_seconds)

This query calculates the 95th percentile of request durations.

Label Manipulation Techniques in PromQL

In PromQL, labels play a key role in identifying and grouping time series. Label manipulation allows you to work with time series based on specific labels or even modify them to perform more refined queries.

Here are some techniques for manipulating labels in PromQL:

1. Label Filtering

Label filtering allows you to select time series that match specific label values. You can filter based on one or more labels, enabling targeted queries.

Example:

http_requests_total{job="api", status="500"}

This query retrieves an all-time series where the job label is api and the status label is 500.

2. Label Matching with Regular Expressions

You can use regular expressions to match label values that fit a particular pattern. This is useful when you want to query multiple label values that follow a common naming scheme.

Example:

http_requests_total{status=~"5.*"}

This query matches all-time series where the status label starts with 5 (e.g., 500, 503, etc.).

📖
You can check out our Scaling Prometheus: Tips, Tricks, and Proven Strategies for more insights on optimizing your Prometheus setup.

3. Label Exclusion

To exclude time series with a specific label value, use the != operator.

Example:

http_requests_total{status!="200"}

4. Adding or Removing Labels

PromQL doesn’t allow you to add new labels to the time series directly, but you can simulate this using the label_replace() function. This can also be used to remove labels or modify existing ones.

Label Join Functionality in PromQL

PromQL’s label join functionality allows you to combine or match time series from different metrics based on shared labels. This can be done using the on() and group_left() / group_right() operators.

1. Joining on Specific Labels

The on() keyword allows you to specify which labels to match when joining a time series. It ensures that only time series with the same label values are combined.

Example:

http_requests_total{job="api"} * on(instance) http_errors_total{job="api"}

This query joins the http_requests_total and http_errors_total time series where the instance label matches. It multiplies the two metrics for the same instance, allowing you to compute an error rate.

2. Using group_left() and group_right()

Sometimes, you might want to join time series from two different metrics, but they don’t have matching labels. In that case, you can use group_left() or group_right() to indicate how to join them.

  • group_left() allows the left-hand side series to have multiple time series for each matching label.
  • group_right() does the same for the right-hand side series.

Example:

up{job="api"} * on(instance) group_left(region) http_requests_total{job="api"}

This query joins the up metric with the http_requests_total metric on the instance label, but it adds the region label from the http_requests_total series to the result.

Label Extraction with Examples

Label extraction in PromQL can be performed using the label_replace() function. This function allows you to manipulate label values, extract substrings, or even reformat them.

Syntax of label_replace():

label_replace(metric, "label_name", "replacement", "source_label", "regex")

Where:

  • metric: The time series to operate on.
  • label_name: The label to create or modify.
  • replacement: The new value to assign.
  • source_label: The label from which to extract data.
  • regex: A regular expression to match the part of the source label to extract.

1. Extracting Parts of a Label

You can extract part of an existing label value and store it in a new label. For example, if a label contains a full name and you only want to extract the first name:

Example:

label_replace(http_requests_total, "method", "$1", "path", "(GET|POST)")

2. Reformatting a Label

If you need to modify the format of a label value, you can use regular expressions with label_replace(). For example, if you want to change http_requests_total values that are in camelCase to snake_case:

Example:

label_replace(http_requests_total, "method", "$1", "path", "(get|post|put|delete)")

This query would replace all occurrences of the HTTP method with the appropriate label value, potentially reformatting the existing string.

📝
For a deeper dive into managing alerts with Prometheus, visit our Prometheus Alertmanager blog.

3. Removing a Label

You can also remove a label by using label_replace() with an empty replacement value. This essentially drops the label from the metric.

Example:

label_replace(http_requests_total, "method", "", "path", ".*")

This removes the method label from the http_requests_total metric based on a regular expression that matches all values.

Real-World Problems and How to Solve Them

Prometheus SLO tracking and SLI calculations

When management starts talking about 9's, here's how to keep track:

  • Availability percentage:
(sum(rate(http_requests_total{code=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))) * 100

This calculates the availability percentage.

  • SLO alert if availability drops below 99.9%:
(sum(rate(http_requests_total{code=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))) < 0.999

This sets an alert if availability is below 99.9%.

  • Detecting outages with absent():
absent(up{job="my-critical-service"}) == 1

This checks if a critical service is absent.

Dealing with High Cardinality

When your queries start timing out, it's time to get smart about cardinality:

  • Reduce cardinality:
sum(avg(request_duration_seconds) by (endpoint)) by (endpoint)

This aggregates to reduce cardinality.

  • Percentiles:
histogram_quantile(0.95, sum(rate(request_duration_bucket[5m])) by (le, endpoint))

This calculates the 95th percentile of request durations.

Capacity Planning

Want to look like a fortune teller? Use these for capacity planning:

Predict CPU usage:

  • 7-day prediction:
predict_linear(node_cpu_seconds_total[30d], 86400 * 7)

This predicts CPU usage 7 days from now.

  • Disk usage growth rate:
deriv(node_filesystem_free_bytes{}[7d])

This calculates the growth rate of disk usage.

Cluster-wide predictions:

sum(predict_linear(node_filesystem_free_bytes{}[30d], 86400 * 30)) by (fs_type)

This predicts disk usage across clusters.

📄

Combining Metrics and Logs

When you need to correlate metrics with log events:

# Count of error logs
sum(rate(log_messages_total{level="error"}[5m])) by (service)

# Correlation: HTTP 500 errors vs error log spikes
sum(rate(http_requests_total{status="500"}[5m])) by (service)
/
sum(rate(log_messages_total{level="error"}[5m])) by (service)

Use the group() operator to analyze metrics and log-derived data together:

group(
  sum(rate(http_requests_total[5m])) by (service),
  sum(rate(log_messages_total{level="error"}[5m])) by (service)
)

Multi-Cluster Queries

For those juggling multiple clusters:

# Query across all clusters
sum(
  avg(up{job="kubernetes-nodes"}) by (cluster)
)

# Compare resource usage across clusters
topk(3,
  sum(
    rate(container_cpu_usage_seconds_total[5m])
  ) by (cluster)
  /
  sum(machine_cpu_cores) by (cluster)
)

Pro move: Use recording rules for better performance:

# Recording rule
cluster:container_cpu_usage:percent =
  sum(rate(container_cpu_usage_seconds_total[5m])) by (cluster)
  /
  sum(machine_cpu_cores) by (cluster)

# Query the pre-computed metric
topk(3, cluster:container_cpu_usage:percent)

Wrapping Up

Look, PromQL isn't always fun, but it's a powerful tool when you know how to use it. These queries have helped me out of some tight spots, and I hope they do the same for you. Remember:

  • Use instant vectors for "right now" data
  • Range vectors are your friend for trends
  • Comparisons help you spot what's changed
  • SLOs and SLIs keep you honest (and employed)
  • High cardinality is the final boss - aggregation is your weapon
  • Capacity planning is just educated guessing, but it impresses management
  • Metrics and logs are better together
  • In a multi-cluster world, federation, and recording rules are lifesavers

Keep these in your back pocket, and the next time someone asks, "What's going on with our system?", you'll have the answers at your fingertips.

Good luck, and happy querying!

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Prathamesh Sonpatki

Prathamesh Sonpatki

Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

X
Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I love breaking down complex concepts into easy-to-understand terms.

Topics