PromQL Cheat Sheet: Must-Know PromQL Queries

A practical PromQL cheat sheet covering quick diagnostic queries, aggregation functions, label manipulation, SLO tracking, capacity planning, and multi-cluster patterns for Prometheus users.

PromQL Cheat Sheet

Contents

Prometheus is the standard tool for monitoring cloud-native systems, and PromQL is its query language. Knowing the right queries saves time when diagnosing issues, tracking trends, or building dashboards. This cheat sheet collects practical PromQL snippets organized by use case — from quick incident queries to advanced aggregation, label manipulation, and capacity planning.

Quick Queries for Common Scenarios

What’s Happening Right Now

When everything’s on fire and you need to know what’s going on ASAP:

sum(rate(http_requests_total[5m])) by (status_code)

This gives you a quick snapshot of HTTP requests broken down by status code.

If you’re drowning in data, narrow it down:

topk(5, sum(rate(http_requests_total[5m])) by (status_code))

This limits the result to the top 5 status codes, reducing noise in busy environments.

📂

Also explore our guide on PromQL: A Developer’s Guide to Mastering Prometheus Queries.

To view average available memory as a percentage over the last hour:

avg_over_time(node_memory_MemAvailable_bytes[1h]) / avg_over_time(node_memory_MemTotal_bytes[1h]) * 100

This shows average available memory as a percentage over the last hour.

For hourly averages over the past day:

avg_over_time(
(avg_over_time(node_memory_MemAvailable_bytes[1h]) / avg_over_time(node_memory_MemTotal_bytes[1h]) * 100)[1d:1h]
)

Identifying Resource Hogs in Prometheus

To rank services by CPU usage:

topk(5, sum by (service) (rate(container_cpu_usage_seconds_total[5m])) / sum by (service) (machine_cpu_cores))

This ranks services by CPU usage, normalized by available cores.

To compare current CPU usage to last week:

topk(5, (
sum by (service) (rate(container_cpu_usage_seconds_total[5m])) / sum by (service) (machine_cpu_cores)
) / (
sum by (service) (rate(container_cpu_usage_seconds_total[5m] offset 1w)) / sum by (service) (machine_cpu_cores offset 1w)
))

This shows which services have had the biggest change in CPU usage compared to last week.

📑

Also read: Troubleshooting Common Prometheus Pitfalls—Cardinality, Resource Utilization, and Storage Issues

Digging Deeper with Advanced Queries

The “Agg and Match” Technique

To find the error rate per client:

sum by (client) (
rate(http_requests_total{status=~"5.."}[5m])
)
/
sum by (client) (
rate(http_requests_total[5m])
)

This gives you the error rate per client. It’s like having x-ray vision for your API troubles.

The Subquery Shuffle

Here’s a beast of a query that finds CPU usage spikes:

max_over_time(
(
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
count by (instance) (node_cpu_seconds_total{mode="idle"})
)[1h:5m]
)
-
min_over_time(
(
sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
/
count by (instance) (node_cpu_seconds_total{mode="idle"})
)[1h:5m]
)

This finds the spread between max and min CPU usage over the last hour, calculated every 5 minutes. Useful for detecting CPU spikes within a time window.

Aggregation Functions in PromQL

PromQL offers several specialized aggregation functions to help you analyze and summarize your time series data efficiently.

These functions are perfect for working with data in groups and applying different types of mathematical or statistical operations.

1. avg()

The avg() function computes the average value of a time series over a specified period. It’s often used to find the mean of values across a set of time series.

Example:

avg(http_requests_total{status="200"})

This query returns the average number of HTTP requests that returned a status code of 200.

2. count()

The count() function counts the number of elements in a time series group. It’s useful when you want to know how many time series match a given filter.

Example:

count(http_requests_total)

This query counts how many HTTP request metrics are being recorded.

3. max()

The max() function returns the maximum value across a group of time series. It’s handy when you want to track the highest value within a group.

Example:

max(cpu_usage_seconds_total)

This query shows the highest CPU usage across all available time series.

4. min()

The min() function returns the minimum value across a group of time series. It’s great for finding the lowest measurement.

Example:

min(memory_usage_bytes)

This query provides the minimum memory usage across all instances.

5. sum()

The sum() function calculates the total sum of values across a group of time series. It’s often used to get an overall sum of metrics like bytes or requests.

Example:

sum(network_bytes_total)

This query sums up the total network traffic across all instances.

6. stddev()

The stddev() function calculates the standard deviation of a time series. It’s useful for measuring the variability or spread of data.

Example:

stddev(request_duration_seconds)

This query computes the standard deviation of request durations to measure their variability.

📝

Read more about PromQL Macros here!

7. stdvar()

The stdvar() function returns the variance of a set of time series. Variance is the square of the standard deviation and measures the spread of values.

Example:

stdvar(request_duration_seconds)

This query returns the variance of the request durations.

8. last_over_time()

The last_over_time() function returns the last recorded value of a time series within a given time window. It’s useful when you want the most recent value.

Example:

This query gives the most recent HTTP request count over the last hour.

9. quantile()

The quantile() function calculates a specified quantile (percentile) of a time series, helping you find values at specific percentiles.

Example:

quantile(0.95, request_duration_seconds)

This query calculates the 95th percentile of request durations.

Label Manipulation Techniques in PromQL

In PromQL, labels play a key role in identifying and grouping time series. Label manipulation allows you to work with time series based on specific labels or even modify them to perform more refined queries.

Here are some techniques for manipulating labels in PromQL:

1. Label Filtering

Label filtering allows you to select time series that match specific label values. You can filter based on one or more labels, enabling targeted queries.

Example:

http_requests_total{job="api", status="500"}

This query retrieves all time series where the job label is api and the status label is 500.

2. Label Matching with Regular Expressions

You can use regular expressions to match label values that fit a particular pattern. This is useful when you want to query multiple label values that follow a common naming scheme.

Example:

http_requests_total{status=~"5.*"}

This query matches all time series where the status label starts with 5 (e.g., 500, 503, etc.).

📖

You can check out our Scaling Prometheus: Tips, Tricks, and Proven Strategies for more insights on optimizing your Prometheus setup.

3. Label Exclusion

To exclude time series with a specific label value, use the != operator.

Example:

http_requests_total{status!="200"}

4. Adding or Removing Labels

PromQL doesn’t allow you to add new labels to the time series directly, but you can simulate this using the label_replace() function. This can also be used to remove labels or modify existing ones.

Label Join Functionality in PromQL

PromQL’s label join functionality allows you to combine or match time series from different metrics based on shared labels. This can be done using the on() and group_left() / group_right() operators.

1. Joining on Specific Labels

The on() keyword allows you to specify which labels to match when joining a time series. It ensures that only time series with the same label values are combined.

Example:

http_requests_total{job="api"} * on(instance) http_errors_total{job="api"}

This query joins the http_requests_total and http_errors_total time series where the instance label matches. It multiplies the two metrics for the same instance, allowing you to compute an error rate.

2. Using group_left() and group_right()

Sometimes, you might want to join time series from two different metrics, but they don’t have matching labels. In that case, you can use group_left() or group_right() to indicate how to join them.

  • group_left() allows the left-hand side series to have multiple time series for each matching label.
  • group_right() does the same for the right-hand side series.

Example:

up{job="api"} * on(instance) group_left(region) http_requests_total{job="api"}

This query joins the up metric with the http_requests_total metric on the instance label, but it adds the region label from the http_requests_total series to the result.

Label Extraction with Examples

Label extraction in PromQL can be performed using the label_replace() function. This function allows you to manipulate label values, extract substrings, or even reformat them.

Syntax of label_replace():

label_replace(metric, "label_name", "replacement", "source_label", "regex")

Where:

  • metric: The time series to operate on.
  • label_name: The label to create or modify.
  • replacement: The new value to assign.
  • source_label: The label from which to extract data.
  • regex: A regular expression to match the part of the source label to extract.

1. Extracting Parts of a Label

You can extract part of an existing label value and store it in a new label. For example, if a label contains a full name and you only want to extract the first name:

Example:

label_replace(http_requests_total, "method", "$1", "path", "(GET|POST)")

2. Reformatting a Label

If you need to modify the format of a label value, you can use regular expressions with label_replace(). For example, if you want to change http_requests_total values that are in camelCase to snake_case:

Example:

label_replace(http_requests_total, "method", "$1", "path", "(get|post|put|delete)")

This query would replace all occurrences of the HTTP method with the appropriate label value, potentially reformatting the existing string.

📝

For a deeper dive into managing alerts with Prometheus, visit our Prometheus Alertmanager blog.

3. Removing a Label

You can also remove a label by using label_replace() with an empty replacement value. This essentially drops the label from the metric.

Example:

label_replace(http_requests_total, "method", "", "path", ".*")

This removes the method label from the http_requests_total metric based on a regular expression that matches all values.

Real-World Problems and How to Solve Them

Prometheus SLO tracking and SLI calculations

Key queries for SLO and SLI tracking:

  • Availability percentage:
(sum(rate(http_requests_total{code=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))) * 100

This calculates the availability percentage.

  • SLO alert if availability drops below 99.9%:
(sum(rate(http_requests_total{code=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))) < 0.999

This sets an alert if availability is below 99.9%.

  • Detecting outages with absent():
absent(up{job="my-critical-service"}) == 1

This checks if a critical service is absent.

📝

Dealing with High Cardinality

When your queries start timing out, it’s time to get smart about cardinality:

  • Reduce cardinality:
sum(avg(request_duration_seconds) by (endpoint)) by (endpoint)

This aggregates to reduce cardinality. See also how to find and fix high cardinality in Prometheus.

  • Percentiles:
histogram_quantile(0.95, sum(rate(request_duration_bucket[5m])) by (le, endpoint))

This calculates the 95th percentile of request durations.

Capacity Planning

Capacity planning queries:

Predict CPU usage:

  • 7-day prediction:
predict_linear(node_cpu_seconds_total[30d], 86400 * 7)

This predicts CPU usage 7 days from now.

  • Disk usage growth rate:
deriv(node_filesystem_free_bytes{}[7d])

This calculates the growth rate of disk usage.

Cluster-wide predictions:

sum(predict_linear(node_filesystem_free_bytes{}[30d], 86400 * 30)) by (fs_type)

This predicts disk usage across clusters.

📄

A Beginner’s Guide to PromQL and Prometheus Query Language.

Combining Metrics and Logs

When you need to correlate metrics with log events:

# Count of error logs
sum(rate(log_messages_total{level="error"}[5m])) by (service)
# Correlation: HTTP 500 errors vs error log spikes
sum(rate(http_requests_total{status="500"}[5m])) by (service)
/
sum(rate(log_messages_total{level="error"}[5m])) by (service)

Use the group() operator to analyze metrics and log-derived data together:

group(
sum(rate(http_requests_total[5m])) by (service),
sum(rate(log_messages_total{level="error"}[5m])) by (service)
)

Multi-Cluster Queries

For those juggling multiple clusters:

# Query across all clusters
sum(
avg(up{job="kubernetes-nodes"}) by (cluster)
)
# Compare resource usage across clusters
topk(3,
sum(
rate(container_cpu_usage_seconds_total[5m])
) by (cluster)
/
sum(machine_cpu_cores) by (cluster)
)

Pro move: Use recording rules for better performance:

# Recording rule
cluster:container_cpu_usage:percent =
sum(rate(container_cpu_usage_seconds_total[5m])) by (cluster)
/
sum(machine_cpu_cores) by (cluster)
# Query the pre-computed metric
topk(3, cluster:container_cpu_usage:percent)

Frequently Asked Questions

What is PromQL?

PromQL (Prometheus Query Language) is the query language built into Prometheus for selecting, filtering, and aggregating time series data. You use it to write expressions that power dashboards, alerts, and ad-hoc metric analysis.

What is the difference between instant vectors and range vectors in PromQL?

An instant vector returns a single value per time series at a specific point in time (e.g., http_requests_total). A range vector returns a set of values over a time window (e.g., http_requests_total[5m]). Range vectors are required by functions like rate() and avg_over_time().

How do I calculate a per-second rate in PromQL?

Use the rate() function with a counter metric and a time window: rate(http_requests_total[5m]). This returns the per-second average rate over the past 5 minutes. Always use rate() with counters. Never use it with gauges.

What is the difference between rate() and irate() in PromQL?

rate() calculates the per-second average rate over the full time window, smoothing out spikes. irate() uses only the last two data points, making it more responsive to sudden changes. Use rate() for dashboards and alerting; use irate() for detecting instantaneous spikes.

How do I filter by label values in PromQL?

Use curly brace selectors: http_requests_total{job="api", status="500"} for exact matches, {status=~"5.."} for regex matches, and {status!="200"} to exclude a value. Multiple label filters combine with AND logic.

How do I find the top N series by a metric in PromQL?

Use topk(N, expr). For example, topk(5, sum(rate(http_requests_total[5m])) by (service)) returns the 5 services with the highest request rate. Use bottomk() for the lowest values.

What does histogram_quantile() do in PromQL?

histogram_quantile(phi, metric) calculates a quantile from a Prometheus histogram. For example, histogram_quantile(0.95, sum(rate(request_duration_bucket[5m])) by (le)) returns the 95th percentile request duration. The le label is required: it defines the histogram bucket upper bounds.

How do I detect when a target is down in PromQL?

Use absent(up{job="my-service"}) == 1 to fire when no time series match the selector. Alternatively, up{job="my-service"} == 0 fires when the target is scraped but reports unhealthy. The absent() approach also covers the case where scraping stops entirely.

When should I use recording rules in Prometheus?

Use recording rules for expensive queries that run frequently — particularly aggregations over large label sets or queries used in multiple dashboards. Recording rules precompute results on a schedule and store them as new metrics, reducing query latency at read time.

How do I predict future resource usage in PromQL?

Use predict_linear(metric[window], seconds). For example, predict_linear(node_filesystem_free_bytes[30d], 86400 * 7) predicts disk space 7 days from now based on the last 30-day trend. Use a long lookback window for more stable predictions.

Conclusion

PromQL covers a wide range of use cases: instant snapshots, rate calculations, aggregations, label filtering, SLO tracking, and capacity planning. The queries in this cheat sheet cover the patterns that come up most in day-to-day Prometheus work.

Last9 is a managed observability platform compatible with Prometheus and OpenTelemetry. Run PromQL against your metrics without operating Prometheus storage yourself. Try Last9 free.

Topics
About the authors
Prathamesh Sonpatki

Prathamesh Sonpatki

Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I

Start observing for free. No lock-in.

OpenTelemetry · Prometheus

Just update your config. Start seeing data on Last9 in seconds.

Datadog · New Relic · Others

We've got you covered. Bring over your dashboards & alerts in one click.

Built on Open Standards

100+ integrations. OTel native, works with your existing stack.