Last9 Last9

Mar 13th, ‘25 / 9 min read

Essential Prometheus Queries: Simple to Advanced

Learn essential Prometheus queries, from simple to advanced, to monitor, troubleshoot, and optimize your systems with confidence.

Essential Prometheus Queries: Simple to Advanced

Monitoring your infrastructure doesn't have to be a headache. With Prometheus, you've got a powerful ally in your corner—but like any tool, knowing how to use it makes all the difference.

Let's cut through the noise and get straight to the good stuff: practical Prometheus query examples that extract exactly the insights you need when you need them most.

What Makes Prometheus Queries So Valuable?

Prometheus query language (PromQL) is the secret weapon in your monitoring arsenal. It's what lets you slice and dice metrics to spot issues before they become outages.

Whether you're tracking CPU usage across a Kubernetes cluster or monitoring API latency, mastering PromQL transforms raw data into actionable insights.

For DevOps engineers and SREs, this means fewer 3 AM wake-up calls and more confident deployments.

💡
If you to get more out of Prometheus, here’s a handy guide on Prometheus functions to help with your queries.

Getting Started: Basic Prometheus Query Examples

Let's kick things off with some foundational queries that every monitoring stack should include:

Simple Metric Selection

http_requests_total

This query returns all time series with the metric name http_requests_total. It's as straightforward as it gets—simply naming the metric you want to see. You'll get back every label combination Prometheus has stored for this metric.

Filtering with Labels

http_requests_total{status="500"}

Now we're getting specific. This query filters the http_requests_total metric to only show requests that resulted in a 500 error code. The curly braces let you filter by any label attached to your metrics, making it easy to zero in on exactly what you need.

Rate Function for Counter Metrics

rate(http_requests_total{job="api-server"}[5m])

Here's where things get interesting. This query calculates the per-second rate of HTTP requests to your API server over the last 5 minutes. The rate() function is your go-to for counter metrics that only increase over time. It helps you see velocity rather than just cumulative totals.

💡
If you want to do more with Prometheus, check out our guide on the Prometheus API and how it can help with your queries.

Intermediate Techniques to Level Up Your Monitoring

These intermediate examples will help you build more sophisticated dashboards:

Aggregating Metrics by Label

sum(rate(http_requests_total[5m])) by (status)

This query takes the rate of all HTTP requests over 5 minutes and adds them together, but keeps them separated by status code. The result? A clean view of your error rates compared to successful requests—perfect for spotting when things start to go sideways.

Finding Top Consumers

topk(5, sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance))

Who's hogging all the CPU? This query shows you the 5 instances with the highest CPU usage. The topk() function ranks your results, making it easy to identify resource hogs at a glance.

Calculating Service Uptime

(sum(up{job="api-server"}) / count(up{job="api-server"})) * 100

This query gives you a clean percentage of your API servers that are currently up. It divides the number of "up" instances by the total count and multiplies by 100. Simple, but incredibly useful for SLO tracking.

Advanced PromQL: Become a Query Expert

Now let's push the envelope with some advanced techniques that separate the pros from the amateurs:

Predicting Future Values

predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 4 * 3600)

This query predicts how much disk space you'll have in 4 hours based on the usage pattern over the last 6 hours. The predict_linear() function is your crystal ball for capacity planning—catch problems before they happen.

You can extend this to create early warning systems for disk capacity issues:

predict_linear(node_filesystem_free_bytes{mountpoint="/"}[6h], 24 * 3600) < 10 * 1024 * 1024 * 1024

This alerts when any filesystem is predicted to have less than 10GB free within the next 24 hours, giving you ample time to add capacity or clean up before users notice any problems.

💡
If you're setting up Prometheus, check out our guide on Prometheus port configuration to get it right.

Complex Alerting Conditions

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.01

This query calculates your error budget by dividing the rate of 5xx errors by the total request rate. If more than 1% of requests are failing, this expression will evaluate to true—perfect for triggering alerts when things go south.

For multi-window analysis to prevent alert flapping, use:

(
  sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
  and
  sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h]))
) > 0.01

This only triggers when both the 5-minute and 1-hour error rates exceed 1%, reducing false alarms from brief spikes while still catching persistent issues.

Histogram Quantiles for Latency Monitoring

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Want to know what your 95th percentile latency looks like? This query takes your request duration histogram and calculates exactly that. The histogram_quantile() function is essential for performance monitoring and SLO adherence.

For more detailed analysis, compare multiple percentiles simultaneously:

{
  p50="histogram_quantile(0.5, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
  p90="histogram_quantile(0.9, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
  p95="histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))",
  p99="histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))"
}

This creates a complete latency profile that helps you distinguish between general slowness and outlier requests affecting a small percentage of users.

Rate of Change Detection

deriv(process_resident_memory_bytes{job="api-server"}[30m]) > 1024 * 1024

This detects memory leaks by alerting when the memory usage is growing faster than 1MB per second over a 30-minute window—catching gradual resource exhaustion before it becomes critical.

Dynamic Baseline Comparison

sum(rate(http_requests_total[5m])) 
  < 
avg_over_time(sum(rate(http_requests_total[5m]))[7d:1h] offset 1d) * 0.7

This detects traffic drops by comparing current request rates against the same time period from previous days. It triggers when traffic falls below 70% of the typical pattern, which could indicate routing issues or upstream service failures.

💡
If you’re working with Prometheus, understanding metric types is key. Here’s a guide on Prometheus metric types to help you out.

Practical Prometheus Query Scenarios

Let's look at how these queries work in real-world situations with detailed examples that you'll encounter in production environments:

Monitoring Kubernetes Pod Resource Usage

sum(rate(container_cpu_usage_seconds_total{pod=~"api-.*"}[5m])) by (pod)

This query shows CPU usage rates for all API pods in your Kubernetes cluster. It uses regex matching (=~) to select pods whose names start with "api-", then groups the results by pod name. This helps you spot which specific pods might need more resources or might be experiencing unusual load patterns.

For more comprehensive Kubernetes monitoring, combine this with memory usage tracking:

sum(container_memory_working_set_bytes{pod=~"api-.*"}) by (pod) / (1024 * 1024)

This gives you memory usage in MB per pod, making it easy to identify memory leaks or pods approaching their limits. Pairing CPU and memory metrics gives you the full resource utilization picture.

Detecting Slow Database Queries

max_over_time(mysql_global_status_slow_queries[1h]) - min_over_time(mysql_global_status_slow_queries[1h])

This query shows how many new slow queries have been logged in the past hour. By subtracting the minimum counter value from the maximum in a time window, you can see the increment even for constantly increasing counters.

You can extend this to monitor database connections and identify potential connection pool issues:

mysql_global_status_threads_connected / mysql_global_variables_max_connections * 100

This percentage tells you how close you are to maxing out your database connections – critical for avoiding application timeouts during traffic spikes.

Tracking API Errors by Endpoint

sum(rate(http_requests_total{status=~"5.."}[5m])) by (path) / sum(rate(http_requests_total[5m])) by (path) * 100

This query gives you the error percentage broken down by API endpoint. It divides the 5xx error rate by the total request rate for each path, multiplied by 100 to get a percentage. This helps you quickly pinpoint which specific endpoints are causing problems rather than just seeing an overall error rate spike.

Alerting on Service Level Objective (SLO) Breaches

sum(rate(http_request_duration_seconds_count{status!~"5.."}[5m])) by (service) / sum(rate(http_request_duration_seconds_count[5m])) by (service) < 0.995

This query alerts when your service availability drops below 99.5% (your SLO). It calculates the ratio of successful requests to total requests over a 5-minute window. Perfect for monitoring compliance with customer SLAs.

Network Traffic Anomaly Detection

abs(
  rate(node_network_transmit_bytes_total[5m])
  - avg_over_time(rate(node_network_transmit_bytes_total[5m])[1d:5m])
) / avg_over_time(rate(node_network_transmit_bytes_total[5m])[1d:5m]) > 0.3

This complex query detects when your current network traffic deviates more than 30% from the typical pattern over the past day. It's fantastic for catching unexpected data transfers that might indicate a security issue or misconfigured application.

💡
If you need to push metrics in Prometheus, check out our guide on the Prometheus Pushgateway and how it works.

How to Optimize Your Prometheus Queries

Even the most powerful query isn't helpful if it brings your monitoring system to its knees. Here are some tips for keeping things fast:

Use Time Ranges Wisely

# Less efficient for long periods
rate(http_requests_total[7d])

# More efficient approach
avg_over_time(rate(http_requests_total[5m])[7d:5m])

The second query is much more efficient because it asks Prometheus to calculate the 5-minute rates first, then average those pre-calculated rates over 7 days. Your Prometheus instance will thank you. This approach can reduce query execution time from minutes to seconds for long time ranges.

Limit Cardinality

# High cardinality - could be hundreds of thousands of series
http_requests_total{path="/api/v1/users/*/profile"}

# Lower cardinality - grouped by status code instead of individual paths
sum(http_requests_total) by (status, method)

High cardinality is the silent killer of monitoring systems. The second query groups metrics by status code and method rather than tracking every unique URL path, dramatically reducing the number of time series from potentially millions to just dozens.

Pre-calculate Expensive Queries with Recording Rules

Instead of repeatedly running expensive queries in dashboards, create recording rules in your Prometheus configuration:

groups:
  - name: api_slos
    interval: 30s
    rules:
      - record: job:http_requests_total:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

Then your dashboard can use the pre-calculated metric:

job:http_requests_total:rate5m{job="api-server"}

This approach can reduce dashboard load times by orders of magnitude and prevent your Prometheus server from becoming overwhelmed during peak usage.

💡
If you're running Prometheus at scale, here are some tips and strategies to keep it efficient.

Favor Subqueries Over Long Range Vectors

# Potentially expensive for high-cardinality metrics
max_over_time(http_requests_total{job="api"}[7d])

# More efficient approach
max_over_time(http_requests_total{job="api"}[1h]) offset $__interval * 0 or
max_over_time(http_requests_total{job="api"}[1h]) offset $__interval * 1 or
...
max_over_time(http_requests_total{job="api"}[1h]) offset $__interval * 167

Breaking down long-range queries into smaller chunks with subqueries or multiple OR conditions can dramatically improve performance by allowing Prometheus to parallelize processing.

Function Type Common Use Cases Performance Impact Optimization Strategy
Aggregations (sum, avg, max) Dashboard overviews, alerting Low to medium Use when filtering high-cardinality dimensions
Range vectors [1h] Rate calculations, trends Medium to high Keep timespan as short as practical
Join operations Cross-metric correlations High Pre-compute with recording rules
Regular expressions Dynamic filtering Very high Replace with explicit label matching when possible
Subqueries Long-term trends, forecasting Very high Use recording rules or federated metrics

Debugging Common PromQL Issues

Let's explore solutions to common pitfalls that can save you hours of troubleshooting:

No Data Points Issue

# Might return no data points
rate(some_counter[5m])

# More resilient approach
rate(some_counter[5m] offset 5m)

Adding an offset can help you see data that was collected even if there's been a recent gap in metrics collection. It's a great way to diagnose whether something stopped reporting or truly dropped to zero.

For alerting scenarios, use the absent() function to detect missing metrics:

absent(up{job="api-server"})

This returns a 1 if the metric is missing entirely, making it perfect for alerting when a service stops reporting metrics completely—often a sign of more serious problems than just high error rates.

Counter Resets

# Vulnerable to counter resets
increase(http_requests_total[1h])

# Handles counter resets better
rate(http_requests_total[1h]) * 3600

The rate() function intelligently handles counter resets, making it more reliable than a simple increase() for longer periods. This is crucial for containers and pods that frequently restart in orchestrated environments.

💡
If you want to manage alerts better, check out our guide on Prometheus Alertmanager and how it helps.

Dealing with Gaps in Time Series

# Might have gaps when service restarts
sum(rate(http_requests_total[5m])) by (service)

# Fills gaps with last known value for up to 5m
sum(rate(http_requests_total[5m])) by (service) or vector(0)

The "or vector(0)" approach ensures your graphs don't show gaps during brief service restarts or metric collection issues, providing visual continuity for easier pattern recognition.

Fixing "No Data Points" in Rate Calculations

# Might fail if the time range isn't long enough
rate(http_requests_total[1m])

# More reliable with shorter scrape intervals
rate(http_requests_total[5m])

Always make sure your rate() time range includes at least two scrape intervals. If your Prometheus scrapes every 15s, a 1m range should be sufficient, but a 5m range provides better reliability, especially during high-load periods when scrapes might be delayed.

Debug Metric Existence and Dimensions

count({__name__=~"node_.*"}) by (__name__)

This meta-query helps you discover what metrics are available in your Prometheus instance. It's incredibly useful when working with a new exporter or trying to find the exact name of a metric you need.

count(node_cpu_seconds_total) by (mode, cpu)

This tells you what label combinations exist for a specific metric, helping you understand the dimensions available for filtering or aggregation.

Wrapping Up

Prometheus queries are like any skill—they get better with practice. Start with the basics, experiment in your test environment, and gradually work your way up to the more complex examples.

💡
What Prometheus queries have helped you out during an incident? Share your experiences and learn from others in our Discord community!
Authors
Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I love breaking down complex concepts into easy-to-understand terms.