Monitoring your infrastructure doesn't have to be a headache. With Prometheus, you've got a powerful ally in your corner—but like any tool, knowing how to use it makes all the difference.
Let's cut through the noise and get straight to the good stuff: practical Prometheus query examples that extract exactly the insights you need when you need them most.
What Makes Prometheus Queries So Valuable?
Prometheus query language (PromQL) is the secret weapon in your monitoring arsenal. It's what lets you slice and dice metrics to spot issues before they become outages.
Whether you're tracking CPU usage across a Kubernetes cluster or monitoring API latency, mastering PromQL transforms raw data into actionable insights.
For DevOps engineers and SREs, this means fewer 3 AM wake-up calls and more confident deployments.
Getting Started: Basic Prometheus Query Examples
Let's kick things off with some foundational queries that every monitoring stack should include:
Simple Metric Selection
http_requests_total
This query returns all time series with the metric name http_requests_total
. It's as straightforward as it gets—simply naming the metric you want to see. You'll get back every label combination Prometheus has stored for this metric.
Filtering with Labels
http_requests_total{status="500"}
Now we're getting specific. This query filters the http_requests_total
metric to only show requests that resulted in a 500 error code. The curly braces let you filter by any label attached to your metrics, making it easy to zero in on exactly what you need.
Rate Function for Counter Metrics
rate(http_requests_total{job="api-server"}[5m])
Here's where things get interesting. This query calculates the per-second rate of HTTP requests to your API server over the last 5 minutes. The rate()
function is your go-to for counter metrics that only increase over time. It helps you see velocity rather than just cumulative totals.
Intermediate Techniques to Level Up Your Monitoring
These intermediate examples will help you build more sophisticated dashboards:
Aggregating Metrics by Label
sum(rate(http_requests_total[5m])) by (status)
This query takes the rate of all HTTP requests over 5 minutes and adds them together, but keeps them separated by status code. The result? A clean view of your error rates compared to successful requests—perfect for spotting when things start to go sideways.
Finding Top Consumers
topk(5, sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance))
Who's hogging all the CPU? This query shows you the 5 instances with the highest CPU usage. The topk()
function ranks your results, making it easy to identify resource hogs at a glance.
Calculating Service Uptime
(sum(up{job="api-server"}) / count(up{job="api-server"})) * 100
This query gives you a clean percentage of your API servers that are currently up. It divides the number of "up" instances by the total count and multiplies by 100. Simple, but incredibly useful for SLO tracking.
Advanced PromQL: Become a Query Expert
Now let's push the envelope with some advanced techniques that separate the pros from the amateurs:
Predicting Future Values
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 4 * 3600)
This query predicts how much disk space you'll have in 4 hours based on the usage pattern over the last 6 hours. The predict_linear()
function is your crystal ball for capacity planning—catch problems before they happen.
You can extend this to create early warning systems for disk capacity issues:
predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24 * 3600) < 10 * 1024 * 1024 * 1024
This alerts when any filesystem is predicted to have less than 10GB free within the next 24 hours, giving you ample time to add capacity or clean up before users notice any problems.
Complex Alerting Conditions
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01
This query calculates your error budget by dividing the rate of 5xx errors by the total request rate. If more than 1% of requests are failing, this expression will evaluate to true—perfect for triggering alerts when things go south.
For multi-window analysis to prevent alert flapping, use:
(
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
and
sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))
) > 0.01
This only triggers when both the 5-minute and 1-hour error rates exceed 1%, reducing false alarms from brief spikes while still catching persistent issues.
Histogram Quantiles for Latency Monitoring
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Want to know what your 95th percentile latency looks like? This query takes your request duration histogram and calculates exactly that. The histogram_quantile()
function is essential for performance monitoring and SLO adherence.
For more detailed analysis, compare multiple percentiles simultaneously:
{
p50="histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
p90="histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
p95="histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
p99="histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
}
This creates a complete latency profile that helps you distinguish between general slowness and outlier requests affecting a small percentage of users.
Rate of Change Detection
deriv(process_resident_memory_bytes{job="api-server"}[30m]) > 1024 * 1024
This detects memory leaks by alerting when the memory usage is growing faster than 1MB per second over a 30-minute window—catching gradual resource exhaustion before it becomes critical.
Dynamic Baseline Comparison
sum(rate(http_requests_total[5m]))
<
avg_over_time(sum(rate(http_requests_total[5m]))[7d:1h] offset 1d) * 0.7
This detects traffic drops by comparing current request rates against the same time period from previous days. It triggers when traffic falls below 70% of the typical pattern, which could indicate routing issues or upstream service failures.
Practical Prometheus Query Scenarios
Let's look at how these queries work in real-world situations with detailed examples that you'll encounter in production environments:
Monitoring Kubernetes Pod Resource Usage
sum(rate(container_cpu_usage_seconds_total{pod=~"api-.*"}[5m])) by (pod)
This query shows CPU usage rates for all API pods in your Kubernetes cluster. It uses regex matching (=~
) to select pods whose names start with "api-", then groups the results by pod name. This helps you spot which specific pods might need more resources or might be experiencing unusual load patterns.
For more comprehensive Kubernetes monitoring, combine this with memory usage tracking:
sum(container_memory_working_set_bytes{pod=~"api-.*"}) by (pod) / (1024 * 1024)
This gives you memory usage in MB per pod, making it easy to identify memory leaks or pods approaching their limits. Pairing CPU and memory metrics gives you the full resource utilization picture.
Detecting Slow Database Queries
max_over_time(mysql_global_status_slow_queries[1h]) - min_over_time(mysql_global_status_slow_queries[1h])
This query shows how many new slow queries have been logged in the past hour. By subtracting the minimum counter value from the maximum in a time window, you can see the increment even for constantly increasing counters.
You can extend this to monitor database connections and identify potential connection pool issues:
mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100
This percentage tells you how close you are to maxing out your database connections – critical for avoiding application timeouts during traffic spikes.
Tracking API Errors by Endpoint
sum(rate(http_requests_total{status=~"5.."}[5m])) by (path) / sum(rate(http_requests_total[5m])) by (path) * 100
This query gives you the error percentage broken down by API endpoint. It divides the 5xx error rate by the total request rate for each path, multiplied by 100 to get a percentage. This helps you quickly pinpoint which specific endpoints are causing problems rather than just seeing an overall error rate spike.
Alerting on Service Level Objective (SLO) Breaches
sum(rate(http_request_duration_seconds_count{status!~"5.."}[5m])) by (service) / sum(rate(http_request_duration_seconds_count[5m])) by (service) < 0.995
This query alerts when your service availability drops below 99.5% (your SLO). It calculates the ratio of successful requests to total requests over a 5-minute window. Perfect for monitoring compliance with customer SLAs.
Network Traffic Anomaly Detection
abs(
rate(node_network_transmit_bytes_total[5m])
- avg_over_time(rate(node_network_transmit_bytes_total[5m])[1d:5m])
) / avg_over_time(rate(node_network_transmit_bytes_total[5m])[1d:5m]) > 0.3
This complex query detects when your current network traffic deviates more than 30% from the typical pattern over the past day. It's fantastic for catching unexpected data transfers that might indicate a security issue or misconfigured application.
How to Optimize Your Prometheus Queries
Even the most powerful query isn't helpful if it brings your monitoring system to its knees. Here are some tips for keeping things fast:
Use Time Ranges Wisely
# Less efficient for long periods
rate(http_requests_total[7d])
# More efficient approach
avg_over_time(rate(http_requests_total[5m])[7d:5m])
The second query is much more efficient because it asks Prometheus to calculate the 5-minute rates first, then average those pre-calculated rates over 7 days. Your Prometheus instance will thank you. This approach can reduce query execution time from minutes to seconds for long time ranges.
Limit Cardinality
# High cardinality - could be hundreds of thousands of series
http_requests_total{path="/api/v1/users/*/profile"}
# Lower cardinality - grouped by status code instead of individual paths
sum(http_requests_total) by (status, method)
High cardinality is the silent killer of monitoring systems. The second query groups metrics by status code and method rather than tracking every unique URL path, dramatically reducing the number of time series from potentially millions to just dozens.
Pre-calculate Expensive Queries with Recording Rules
Instead of repeatedly running expensive queries in dashboards, create recording rules in your Prometheus configuration:
groups:
- name: api_slos
interval: 30s
rules:
- record: job:http_requests_total:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
Then your dashboard can use the pre-calculated metric:
job:http_requests_total:rate5m{job="api-server"}
This approach can reduce dashboard load times by orders of magnitude and prevent your Prometheus server from becoming overwhelmed during peak usage.
Favor Subqueries Over Long Range Vectors
# Potentially expensive for high-cardinality metrics
max_over_time(http_requests_total{job="api"}[7d])
# More efficient approach
max_over_time(http_requests_total{job="api"}[1h]) offset $__interval * 0 or
max_over_time(http_requests_total{job="api"}[1h]) offset $__interval * 1 or
...
max_over_time(http_requests_total{job="api"}[1h]) offset $__interval * 167
Breaking down long-range queries into smaller chunks with subqueries or multiple OR conditions can dramatically improve performance by allowing Prometheus to parallelize processing.
Function Type | Common Use Cases | Performance Impact | Optimization Strategy |
---|---|---|---|
Aggregations (sum, avg, max) | Dashboard overviews, alerting | Low to medium | Use when filtering high-cardinality dimensions |
Range vectors [1h] | Rate calculations, trends | Medium to high | Keep timespan as short as practical |
Join operations | Cross-metric correlations | High | Pre-compute with recording rules |
Regular expressions | Dynamic filtering | Very high | Replace with explicit label matching when possible |
Subqueries | Long-term trends, forecasting | Very high | Use recording rules or federated metrics |
Debugging Common PromQL Issues
Let's explore solutions to common pitfalls that can save you hours of troubleshooting:
No Data Points Issue
# Might return no data points
rate(some_counter[5m])
# More resilient approach
rate(some_counter[5m] offset 5m)
Adding an offset can help you see data that was collected even if there's been a recent gap in metrics collection. It's a great way to diagnose whether something stopped reporting or truly dropped to zero.
For alerting scenarios, use the absent() function to detect missing metrics:
absent(up{job="api-server"})
This returns a 1 if the metric is missing entirely, making it perfect for alerting when a service stops reporting metrics completely—often a sign of more serious problems than just high error rates.
Counter Resets
# Vulnerable to counter resets
increase(http_requests_total[1h])
# Handles counter resets better
rate(http_requests_total[1h]) * 3600
The rate()
function intelligently handles counter resets, making it more reliable than a simple increase()
for longer periods. This is crucial for containers and pods that frequently restart in orchestrated environments.
Dealing with Gaps in Time Series
# Might have gaps when service restarts
sum(rate(http_requests_total[5m])) by (service)
# Fills gaps with last known value for up to 5m
sum(rate(http_requests_total[5m])) by (service) or vector(0)
The "or vector(0)" approach ensures your graphs don't show gaps during brief service restarts or metric collection issues, providing visual continuity for easier pattern recognition.
Fixing "No Data Points" in Rate Calculations
# Might fail if the time range isn't long enough
rate(http_requests_total[1m])
# More reliable with shorter scrape intervals
rate(http_requests_total[5m])
Always make sure your rate() time range includes at least two scrape intervals. If your Prometheus scrapes every 15s, a 1m range should be sufficient, but a 5m range provides better reliability, especially during high-load periods when scrapes might be delayed.
Debug Metric Existence and Dimensions
count({__name__=~"node_.*"}) by (__name__)
This meta-query helps you discover what metrics are available in your Prometheus instance. It's incredibly useful when working with a new exporter or trying to find the exact name of a metric you need.
count(node_cpu_seconds_total) by (mode, cpu)
This tells you what label combinations exist for a specific metric, helping you understand the dimensions available for filtering or aggregation.
Wrapping Up
Prometheus queries are like any skill—they get better with practice. Start with the basics, experiment in your test environment, and gradually work your way up to the more complex examples.