Sep 12th, ‘24/2 min read

PromQL Cheat Sheet: Must-Know PromQL Queries

This cheat sheet provides practical guidance for diagnosing issues and understanding trends.

PromQL Cheat Sheet: Must-Know PromQL Queries

Look, we've all been there - staring at a Prometheus dashboard, trying to figure out why our system's acting up. PromQL can be a pain, but it's also incredibly powerful when you know how to use it.

I've spent countless hours fumbling through queries, and I want to save you some of that hassle. Here's a collection of PromQL snippets that have helped me in the trenches.

Table of Contents

  • Quick Queries for Common Scenarios
  • Digging Deeper with Advanced Queries
  • Real-World Problems and How to Solve Them
  • Wrapping Up
  • Quick Queries for Common Scenarios

    What's Happening Right Now

    When everything's on fire and you need to know what's going on ASAP:

    sum(rate(http_requests_total[5m])) by (status_code)
    

    This gives you a quick snapshot of your HTTP requests, broken down by status code. It's saved my bacon more times than I can count.

    If you're drowning in data, narrow it down:

    topk(5, sum(rate(http_requests_total[5m])) by (status_code))
    

    Now you're just looking at the top 5 status codes. It's like noise-canceling headphones for your metrics.

    📂
    Also explore our guide on PromQL: A Developer's Guide to Mastering Prometheus Queries.

    When your boss asks, "How's the system been doing lately?", try this:

    avg_over_time(node_memory_MemAvailable_bytes[1h]) / avg_over_time(node_memory_MemTotal_bytes[1h]) * 100
    

    This shows you the average available memory as a percentage over the last hour. It's a quick way to see if you're headed for trouble.

    Want to impress them with a day's worth of data? Just tweak it a bit:

    avg_over_time(
      (avg_over_time(node_memory_MemAvailable_bytes[1h]) / avg_over_time(node_memory_MemTotal_bytes[1h]) * 100)[1d:1h]
    )
    

    Now you're looking at hourly averages for the past day. It's like a time-lapse for your memory usage.

    Identifying Resource Hogs in Prometheus

    When you need to know which service is hogging all the resources:

    topk(5, sum by (service) (rate(container_cpu_usage_seconds_total[5m])) / sum by (service) (machine_cpu_cores))
    

    This ranks your services by CPU usage. It's great for finding out which service needs optimization (or which team you need to have a friendly chat with).

    Want to compare current usage to last week? Here's a nifty trick:

    topk(5, (
      sum by (service) (rate(container_cpu_usage_seconds_total[5m])) / sum by (service) (machine_cpu_cores)
    ) / (
      sum by (service) (rate(container_cpu_usage_seconds_total[5m] offset 1w)) / sum by (service) (machine_cpu_cores offset 1w)
    ))
    

    This shows you which services have had the biggest change in CPU usage compared to last week. Perfect for catching performance regressions or unexpected improvements.

    📑
    Also read: Troubleshooting Common Prometheus Pitfalls—Cardinality, Resource Utilization, and Storage Issues

    Digging Deeper with Advanced Queries

    The "Agg and Match" Technique

    Ever needed to find out which client is causing the most errors? Try this:

    Copy
    sum by (client) (
      rate(http_requests_total{status=~"5.."}[5m])
    )
    /
    sum by (client) (
      rate(http_requests_total[5m])
    )
    

    This gives you the error rate per client. It's like having x-ray vision for your API troubles.

    The Subquery Shuffle

    Here's a beast of a query that finds CPU usage spikes:

    max_over_time(
      (
        sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
        /
        count by (instance) (node_cpu_seconds_total{mode="idle"})
      )[1h:5m]
    )
    -
    min_over_time(
      (
        sum by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
        /
        count by (instance) (node_cpu_seconds_total{mode="idle"})
      )[1h:5m]
    )
    

    This finds the difference between the max and min CPU usage over the last hour, calculated every 5 minutes. It's like a CPU usage roller coaster detector.

    Real-World Problems and How to Solve Them

    Prometheus SLO tracking and SLI calculations

    When management starts talking about 9's, here's how to keep track:

    • Availability percentage:
    (sum(rate(http_requests_total{code=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))) * 100
    

    This calculates the availability percentage.

    • SLO alert if availability drops below 99.9%:
    (sum(rate(http_requests_total{code=~"2..|3.."}[30d])) / sum(rate(http_requests_total[30d]))) < 0.999
    

    This sets an alert if availability is below 99.9%.

    • Detecting outages with absent():
    absent(up{job="my-critical-service"}) == 1
    

    This checks if a critical service is absent.

    Dealing with High Cardinality

    When your queries start timing out, it's time to get smart about cardinality:

    • Reduce cardinality:
    sum(avg(request_duration_seconds) by (endpoint)) by (endpoint)
    

    This aggregates to reduce cardinality.

    • Percentiles:
    histogram_quantile(0.95, sum(rate(request_duration_bucket[5m])) by (le, endpoint))
    

    This calculates the 95th percentile of request durations.

    Capacity Planning

    Want to look like a fortune teller? Use these for capacity planning:

    Predict CPU usage:

    • 7-day prediction:
    predict_linear(node_cpu_seconds_total[30d], 86400 * 7)
    

    This predicts CPU usage 7 days from now.

    • Disk usage growth rate:
    deriv(node_filesystem_free_bytes{}[7d])
    

    This calculates the growth rate of disk usage.

    Cluster-wide predictions:

    sum(predict_linear(node_filesystem_free_bytes{}[30d], 86400 * 30)) by (fs_type)
    

    This predicts disk usage across clusters.

    📄

    Combining Metrics and Logs

    When you need to correlate metrics with log events:

    # Count of error logs
    sum(rate(log_messages_total{level="error"}[5m])) by (service)
    
    # Correlation: HTTP 500 errors vs error log spikes
    sum(rate(http_requests_total{status="500"}[5m])) by (service)
    /
    sum(rate(log_messages_total{level="error"}[5m])) by (service)
    

    Use the group() operator to analyze metrics and log-derived data together:

    group(
      sum(rate(http_requests_total[5m])) by (service),
      sum(rate(log_messages_total{level="error"}[5m])) by (service)
    )
    

    Multi-Cluster Queries

    For those juggling multiple clusters:

    # Query across all clusters
    sum(
      avg(up{job="kubernetes-nodes"}) by (cluster)
    )
    
    # Compare resource usage across clusters
    topk(3,
      sum(
        rate(container_cpu_usage_seconds_total[5m])
      ) by (cluster)
      /
      sum(machine_cpu_cores) by (cluster)
    )
    

    Pro move: Use recording rules for better performance:

    # Recording rule
    cluster:container_cpu_usage:percent =
      sum(rate(container_cpu_usage_seconds_total[5m])) by (cluster)
      /
      sum(machine_cpu_cores) by (cluster)
    
    # Query the pre-computed metric
    topk(3, cluster:container_cpu_usage:percent)
    

    Wrapping Up

    Look, PromQL isn't always fun, but it's a powerful tool when you know how to use it. These queries have helped me out of some tight spots, and I hope they do the same for you. Remember:

    • Use instant vectors for "right now" data
    • Range vectors are your friend for trends
    • Comparisons help you spot what's changed
    • SLOs and SLIs keep you honest (and employed)
    • High cardinality is the final boss - aggregation is your weapon
    • Capacity planning is just educated guessing, but it impresses management
    • Metrics and logs are better together
    • In a multi-cluster world, federation, and recording rules are lifesavers

    Keep these in your back pocket, and the next time someone asks, "What's going on with our system?", you'll have the answers at your fingertips.

    Good luck, and happy querying!

    Newsletter

    Stay updated on the latest from Last9.

    Authors

    Prathamesh Sonpatki

    Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

    Anjali Udasi

    Helping to make the tech a little less intimidating. I love breaking down complex concepts into easy-to-understand terms.

    Handcrafted Related Posts