Last9 Last9

Sep 4th, ‘24 / 9 min read

PromQL: A Developer's Guide to Prometheus Query Language

Our developer’s guide breaks down Prometheus Query Language in an easy-to-understand way, helping you monitor and analyze your metrics like a pro.

PromQL: A Developer's Guide to Prometheus Query Language

As an engineer who has spent countless hours working with monitoring and metric systems, I can say that Prometheus Query Language (PromQL) is quite amazing to use for metric data.

In this post, I'll walk you through the ins and outs of PromQL, sharing my experiences and the lessons I've learned along the way.

What is PromQL?

PromQL is the query language used by Prometheus, an open-source monitoring and alerting toolkit. It's designed specifically for working with time series data, making it an essential tool for anyone dealing with metrics and observability in modern software systems.

Prometheus, and by extension PromQL, has become a cornerstone of cloud-native observability, especially in environments using Kubernetes. Its power lies in its ability to process and analyze vast amounts of time series data efficiently, providing real-time insights into system behavior.

Understanding Time Series Data

Before we dive into PromQL syntax, it's crucial to understand the nature of time series data.

In Prometheus, a time series is identified by its metric name and a set of key-value pairs called labels. Each data point in a time series consists of:

  1. A float64 value
  2. A millisecond-precision timestamp

For example, a time series for HTTP requests might look like this:

http_requests_total{status="200", method="GET"} 1234 1623456789000

Here, http_requests_total is the metric name, {status="200", method="GET"} are the labels, 1234 is the value, and 1623456789000 is the timestamp.

I have captured the basics of promql in another post that covers the anatomy of metric and some basic queries.

Getting Started with PromQL Syntax

When I first encountered PromQL, I was struck by its elegance and power. Unlike SQL, which I was more familiar with, PromQL is tailored for time series data. Let's start with a basic example:

http_requests_total

This simple query returns an instant vector containing the current value of the http_requests_total metric for all monitored endpoints. But PromQL's real power comes from its ability to select, filter, and aggregate data.

Selectors and Labels

One of the first things I learned was how to use selectors to filter metrics based on their labels:

http_requests_total{status="200", method="GET"}

This query selects only the HTTP requests with a status code of 200 and GET method. Labels are key-value pairs that provide additional context to metrics, and they're incredibly useful for drilling down into specific data.

You can use various matching operators with labels:

  • =: Exact match
  • !=: Not equal
  • =~: Regex match
  • !~: Regex not match

For example, to select all HTTP requests with status codes in the 4xx range:

http_requests_total{status=~"4.."}

Range Vectors

While instant vectors give you a snapshot of metrics at a single point in time, range vectors allow you to work with data over a time range. Here's an example:

http_requests_total[5m]

This query returns all the values of http_requests_total over the last 5 minutes. Range vectors are particularly useful when you need to calculate rates or trends.

In addition to this, you could also look at the Prometheus data types article that goes more in detail.

📑
Learn more about Prometheus Metric Types in this detailed guide.

PromQL Data Types

As you dive deeper into PromQL, you'll encounter four main data types:

  1. Instant Vector: A set of time series, each with a single sample at a specific time.
  2. Range Vector: A set of time series with a range of samples over time.
  3. Scalar: A simple numeric floating-point value.
  4. String: A simple string value (currently unused in PromQL).

Understanding these data types is crucial for writing effective queries and avoiding common pitfalls.

Functions and Operators

PromQL provides a rich set of functions and operators to manipulate and analyze time series data. Let's explore some of the most commonly used ones:

Rate and Increase

The rate() function is one of the most frequently used in PromQL. It calculates the per-second average rate of increase of the time series in a range vector. For example:

rate(http_requests_total[5m])

This query calculates the per-second rate of HTTP requests over the last 5 minutes.

The increase() function is similar but returns the total increase in the counter over the time range. It's often more intuitive for counters that increase by integers:

increase(http_requests_total[1h])

This gives you the total number of HTTP requests in the last hour.

Aggregation Operators

PromQL shines when it comes to aggregating data. Here are some common aggregation operators:

  • sum: Calculate the sum over dimensions
  • avg: Calculate the average over dimensions
  • min and max: Find the minimum or maximum over dimensions
  • count: Count the number of elements in a vector

For example, to get the total HTTP requests across all endpoints:

sum(http_requests_total)

You can also aggregate by specific labels:

sum(http_requests_total) by (status)

This gives you the total requests grouped by status code.

📑
Explore practical strategies for downsampling and aggregating metrics in Prometheus to manage cardinality and improve query performance.

Histograms and Summaries

Prometheus supports histogram and summary metric types, which are crucial for measuring distributions of values, like request durations. Here's an example that calculates the 95th percentile of request durations:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

This query uses the histogram_quantile function along with sum and rate to give us valuable insights into our application's performance.

Kubernetes Monitoring

In our deployments at Last9, we use Levitate and by extension PromQL extensively for monitoring all workloads on Kubernetes. Here's a query I wrote to track CPU usage across all nodes:

sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)

This query aggregates CPU usage for all containers, grouped by node, giving us a clear picture of resource utilization across the cluster.

Another useful query for Kubernetes monitoring is tracking the number of pods per node:

sum(kube_pod_info) by (node)

These queries have been very helpful in monitoring the health and performance of our Kubernetes clusters.

Integration with Grafana

One of the things I love about PromQL is how well it integrates with visualization tools like Grafana. I often use PromQL queries directly in Grafana dashboards to create real-time visualizations of our metrics.

💡
Here's a tip: When creating Grafana dashboards, use variables to make your queries more dynamic.

For example:

sum(rate(http_requests_total{status=~"$status"}[5m])) by (method)

By using the $status variable, you can create interactive dashboards that allow users to filter data on the fly.

Common Pitfalls and Best Practices

Through my journey with PromQL, I've encountered a few gotchas that are worth sharing:

  1. Label Matching: Be careful with label matching. The query {job="api"} will match any time series with a job label of "api", while {job!="api"} will match time series that have a job label with any value other than "api", but also those without a job label at all.
  2. Rate vs. Increase: The rate() function is great for per-second averages, but for counters that increase by integers, increase() often gives more intuitive results.
  3. Aggregation and Labels: When using aggregation operators, be mindful of the labels. The query sum(http_requests_total) will sum across all labels, potentially giving you a single value. If you want to preserve certain labels, use the by clause: sum(http_requests_total) by (status).
  4. Range Vector Selector: Always use an appropriate time range in your range vector selector. Too short, and you might miss important data; too long, and your queries might become slow or less relevant.
  5. Subquery Pitfalls: Subqueries can be powerful but also confusing. Make sure you understand the evaluation order and be cautious with nested subqueries, as they can significantly impact performance.

Advanced PromQL Techniques

As you become more comfortable with basic PromQL, you'll want to leverage its more advanced features to gain deeper insights into your Prometheus metrics.

Subqueries and Time Duration

Subqueries allow you to perform query operations over a range of evaluation times. They're particularly useful for calculating "moving" averages or detecting slow-moving trends. Here's an example using a time duration:

max_over_time(rate(http_requests_total[5m])[1h:5m])

This query calculates the maximum rate of HTTP requests over the past hour, evaluating the rate every 5 minutes. The [1h:5m] syntax specifies a time duration of 1 hour with a step of 5 minutes.

Working with Timestamps

PromQL allows you to query data at specific timestamps. This is useful for historical analysis:

http_requests_total @ 1609459200

This query returns the value of http_requests_total at the Unix timestamp 1609459200 (January 1, 2021, 00:00:00 UTC).

Arithmetic Operators

PromQL supports various arithmetic operators that allow you to perform calculations on metric values:

(instance_memory_limit_bytes - instance_memory_usage_bytes) / instance_memory_limit_bytes

This query calculates the memory usage ratio for each instance.

Complex Query Examples

Let's look at some more complex queries that demonstrate the power of PromQL in real-world scenarios.

Calculating Average and Standard Deviation

To calculate the average (avg) and standard deviation (stddev) of request durations:

avg(rate(http_request_duration_seconds_sum[5m])) by (path)
stddev(rate(http_request_duration_seconds_sum[5m])) by (path)

These queries give you the average and standard deviation of request durations for each path over the last 5 minutes.

Working with Label Values

Label values are crucial in PromQL for filtering and grouping data. Here's an example that filters based on label values:

sum(rate(http_requests_total{status=~"5.."}[5m])) by (path)

This query sums the rate of 5xx errors, grouped by the path label value.

Range Queries

Range queries allow you to evaluate PromQL expressions over a range of time. Here's an example:

rate(http_requests_total[5m])[30m:1m]

This range query calculates the rate of HTTP requests over 5-minute windows for the last 30 minutes, with a 1-minute resolution.

📖
Discover the ultimate Prometheus toolkit — your go-to resource to jumpstart effective monitoring.

PromQL and Its Quirks

While PromQL is powerful, it has some quirks that can trip up even experienced users. Let's explore a few of these, focusing on metric types and comparison operators.

Understanding Metric Types

Prometheus supports several metric types, including counters, gauges, and histograms. Each type has its own characteristics and appropriate usage. For example, you should use the rate() function with counters, but not with gauges:

rate(http_requests_total[5m]) # Correct for a counter
rate(cpu_usage_percent[5m]) # Incorrect for a gauge

Comparison Operators

PromQL supports various comparison operators, but they work slightly differently than in other languages. For example:

http_requests_total > 100

This query returns all time series where the current value is greater than 100, not just a boolean result.

Performance Optimization

As your Prometheus monitoring system grows, you might encounter performance issues. Here are some tips to keep your queries fast:

  1. Avoid High Cardinality: Queries that generate a large number of time series can be slow. Be cautious with label combinations that produce many unique series.
  2. Optimize Read Operations: Minimize the amount of data you need to read. Use appropriate time ranges and avoid unnecessary high-resolution queries.
  3. Use Recording Rules: For complex queries that you run frequently, consider setting up recording rules to pre-compute the results.

PromQL in Practice: A Real-World Example

Let's walk through a real-world scenario involving Docker containers. We want to monitor the CPU usage of our containers and alert when it exceeds a threshold:

100 * sum(rate(container_cpu_usage_seconds_total{image!=""}[5m])) by (container_name)
  / sum(container_spec_cpu_quota{image!=""} / container_spec_cpu_period{image!=""}) by (container_name)
> 80

This query:

  1. Calculates the CPU usage rate for each container over 5 minutes.
  2. Divides it by the CPU quota to get a percentage.
  3. Alerts if any container's CPU usage exceeds 80% of its quota.

Extending Prometheus: Custom Exporters

While Prometheus comes with many built-in exporters, you might need to monitor systems or applications that don't have existing exporters. In these cases, writing a custom exporter can be incredibly valuable.

Here's a simplified example of a custom exporter that exposes application-specific metrics:

from prometheus_client import start_http_server, Gauge
import random
import time

# Create metrics
ACTIVE_USERS = Gauge('app_active_users', 'Number of active users')
REQUEST_LATENCY = Gauge('app_request_latency_seconds', 'Request latency in seconds')

# Simulate metrics
def generate_metrics():
    while True:
        ACTIVE_USERS.set(random.randint(50, 100))
        REQUEST_LATENCY.set(random.uniform(0.1, 2.0))
        time.sleep(5)

if __name__ == '__main__':
    start_http_server(8000)
    generate_metrics()

This exporter creates two custom metrics: app_active_users and app_request_latency_seconds.

Conclusion

PromQL has become an indispensable tool in my observability toolkit. Its power and flexibility make it perfect for monitoring complex, distributed systems. As we've explored in this two-part guide, PromQL offers a wide range of capabilities, from simple metric queries to complex aggregations and predictions.

Key takeaways from our deep dive into PromQL include:

  1. Versatility: PromQL can handle a wide range of monitoring scenarios, from simple health checks to complex performance analyses.
  2. Integration: Its seamless integration with tools like Grafana makes it a cornerstone of modern observability stacks.
  3. Scalability: With proper optimization, PromQL can scale to handle large volumes of time series data efficiently.
  4. Extensibility: Custom exporters allow you to bring Prometheus-style monitoring to any system or application.
  5. Continuous Learning: The PromQL ecosystem is constantly evolving, with new features and best practices emerging regularly.

Remember, effective monitoring is not just about collecting data points; it's about asking the right questions about your metrics. PromQL gives you the language to ask those questions, and with time, you'll find it becomes second nature.

As you continue your PromQL journey, keep exploring, keep learning, and most importantly, keep querying. The insights you uncover might just be the key to unlocking the next level of performance and reliability in your systems.

Happy querying, and may your systems always be observable!

We'd love to hear about your experiences with reliability, observability, or monitoring. Let’s share insights and chat about these topics in the SRE Discord community.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Gabriel Diaz

Gabriel Diaz

Software Engineer at Last9