Prometheus Functions: How to Make the Most of Your Metrics

Keeping track of your infrastructure is non-negotiable. Prometheus makes that easier by collecting metrics and alerting you when something’s off. It’s a powerful tool that helps you understand what’s happening under the hood, whether you’re running a small cluster or managing large-scale applications.

In this guide, we’ll break down Prometheus functions—what they do, how they work, and why they matter for better observability. Let’s get into it.

Understanding Prometheus Functions

Prometheus isn't just another monitoring tool—it's a complete ecosystem that shines brightest when you know how to query it properly.

At its core, Prometheus collects metrics from your targets, stores them as time series data, and lets you slice and dice this information using its query language.

Think of Prometheus as that friend who remembers everything about everyone. It's constantly watching your systems, recording what's happening, and waiting for you to ask the right questions.

What Makes Prometheus Different?

Unlike traditional monitoring systems that might rely on complex dashboards or pre-defined reports, Prometheus puts the power in your hands through its flexible query language. You get to decide what matters, when it matters, and how to visualize it.

The secret sauce? PromQL—Prometheus Query Language—which lets you transform, combine, and analyze your metrics in ways that would make a data scientist jealous.

💡

Check out our beginner’s guide on PromQL here to build a strong foundation before jumping into advanced functions.

Prometheus and Its Functions

Prometheus is an open-source monitoring and alerting toolkit originally built at SoundCloud. It's now a standalone project maintained by a vibrant community as part of the Cloud Native Computing Foundation.

Think of Prometheus as your infrastructure's nervous system—constantly collecting signals from various parts of your application ecosystem and helping you make sense of them.

The Core Components

Prometheus Server: Collects and stores time series data
Client Libraries: Help you instrument your code
Alertmanager: Handles alerts
Pushgateway: Supports short-lived jobs
Exporters: Get metrics from third-party systems

But the real magic happens when you start using PromQL to query all this data.

PromQL: Your Gateway to Metric Insights

PromQL is what sets Prometheus apart from other monitoring solutions. This domain-specific language was designed for one thing: letting you select and aggregate time series data in real time.

Here's a simple example:

http_requests_total{status="200"}

This query returns the total number of HTTP requests with a status code of 200. Simple, right? But we're just getting started.

Let's say you want to see the rate of these requests over the last 5 minutes:

rate(http_requests_total{status="200"}[5m])

Now you're not just looking at a static number, but how that number is changing over time—which is much more useful for understanding system behavior.

💡

If you’ve ever wondered how the rate() function really works, check out our guide here for a closer look.

Comparing Query Speeds: Prometheus vs. Grafana Loki

When you're troubleshooting a production issue at 3 AM, query speed matters. Let's look at how Prometheus stacks up against one of its cousins in the monitoring world: Grafana Loki.

The Fundamental Difference

Prometheus is built for metrics—numerical data collected at regular intervals. Loki, on the other hand, specializes in logs—text-based records of events.

This specialization gives Prometheus a significant edge when it comes to querying metrics. Here's why:

Data Storage Model: Prometheus stores metrics in a highly optimized time-series database, while Loki is designed for log data
Indexing Approach: Prometheus indexes every metric and its labels, making queries lightning fast
Query Language Design: PromQL was built specifically for time series analysis

Practical Query Performance

In practice, this means Prometheus can execute complex aggregations across thousands of time series in milliseconds. For example, calculating the 95th percentile response time across all your API endpoints might look like this:

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))

This query might take under 100ms in Prometheus, even with thousands of time series to process. The equivalent operation in Loki would require extracting numerical values from logs and performing calculations that it wasn't optimized for.

The Impact of PromQL's Aggregation Operators

PromQL shines brightest when you need to aggregate data. Operators like sum, avg, min, max, and topk let you summarize metrics across your entire infrastructure with minimal code:

# Find your 3 busiest services by CPU usage
topk(3, sum(rate(process_cpu_seconds_total[5m])) by (service))

With just one line, you've identified the services you should focus on first when optimizing resource usage.

💡

If writing complex PromQL queries feels like a hassle, recording rules can help! Check out our guide here to see how they make queries faster and easier.

Exploring Time Series Metrics in Prometheus

In the Prometheus universe, everything revolves around time series data. But what exactly does that mean?

Anatomy of a Time Series Metric

A time series is simply a sequence of data points, each with a timestamp and a value. In Prometheus, each time series is uniquely identified by:

A metric name (like http_requests_total)
A set of labels (key-value pairs like {status="200", method="GET"})

Together, the metric name and its labels form a unique identifier for each time series. This model is incredibly flexible, allowing you to track virtually any numeric value over time.

Types of Metrics in Prometheus

Prometheus supports four core metric types:

Counter: A cumulative metric that can only increase (or reset to zero). Perfect for tracking things like request counts, errors, or completed tasks.
Gauge: A metric that can go up and down. Ideal for measurements like temperature, memory usage, or concurrent connections.
Histogram: Samples observations and counts them in configurable buckets. Used for things like request duration or response size.
Summary: Similar to histogram, but also calculates configurable quantiles over a sliding time window.

Practical Examples

Let's look at some common metrics you might track:

CPU Usage: rate(process_cpu_seconds_total[5m]) shows the CPU usage rate over 5 minutes
Memory Usage: process_resident_memory_bytes gives you the current memory consumption
Request Rate: rate(http_requests_total[5m]) shows how many requests you're handling per second
Error Rate: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) calculates the percentage of requests that are resulting in 5xx errors

These metrics give you real-time insights into your system's health and performance.

💡

If managing alerts in Prometheus feels overwhelming, Alertmanager can help! Learn how to streamline notifications here.

How to Use Range Vectors in Alerting Rules

Now that you understand the basics, let's talk about one of Prometheus's most powerful features: range vectors and how they transform your alerting strategy.

What Are Range Vectors?

A range vector selects a range of samples for each time series over a specified time period. You create range vectors by adding a time selector in square brackets:

http_requests_total[5m]

This gives you all samples for the http_requests_total metric over the last 5 minutes.

Range Vector Functions for Effective Alerting

Range vectors become particularly powerful when combined with functions like:

rate(): Calculates per-second average rate of increase
irate(): Calculates per-second instant rate of increase
increase(): Calculates the increase in the counter value
max_over_time(): Returns the maximum value over the specified time range
min_over_time(): Returns the minimum value
avg_over_time(): Returns the average value

Building Smarter Alerts

Instead of triggering alerts based on a single point-in-time measurement, range vectors let you create alerts based on trends.

For example, rather than alerting when CPU usage hits 90%:

process_cpu_seconds_total > 0.9

You could alert when the average CPU usage over 5 minutes exceeds 90%:

avg_over_time(process_cpu_seconds_total[5m]) > 0.9

Or even better, alert when the CPU usage is trending upward rapidly:

predict_linear(process_cpu_seconds_total[15m], 3600) > 0.95

This last example predicts the CPU usage one hour into the future based on the trend over the last 15 minutes, alerting you before you actually hit a problem.

💡

Scaling Prometheus can get tricky, but the right strategies make all the difference. Check out our tips and tricks here.

Advanced Techniques for Capturing Spikes with PromQL

Let's take your Prometheus skills to the next level with techniques to capture and analyze metric spikes.

Identifying Counter Resets

Counters in Prometheus can reset to zero when a service restarts. The resets() function helps you identify when this happens:

resets(http_requests_total[1h])

This returns the number of counter resets within the last hour, which can be a useful indicator of service stability.

Detecting Rate Changes

To detect sudden changes in rates, you can compare the short-term rate to a longer-term average:

abs(
  rate(http_requests_total[5m]) 
  / 
  rate(http_requests_total[1h])
  - 1
) > 0.3

This alerts when the 5-minute rate differs from the 1-hour rate by more than 30%, indicating a significant change in traffic patterns.

Using Regex and Label Matching for Precision

PromQL supports powerful label matching techniques that let you zero in on exactly the metrics you care about:

# Exact match
http_requests_total{status="500"}

# Regex match
http_requests_total{status=~"5.."}

# Negative regex match
http_requests_total{status!~"2.."}

You can combine these with aggregation operators for even more precision:

# Alert if any instance has more than 10% error rate
sum by(instance) (rate(http_requests_total{status=~"5.."}[5m])) 
/ 
sum by(instance) (rate(http_requests_total[5m])) 
> 0.1

Handling Bursty Traffic

For services with bursty traffic patterns, standard averaging might miss important spikes. In these cases, consider using max_over_time or quantile functions:

# Alert on any 30-second window with high error rates in the last 5 minutes
max_over_time(
  (sum(rate(http_requests_total{status=~"5.."}[30s])) 
   / 
   sum(rate(http_requests_total[30s])))
  [5m]
) > 0.05

This checks if there was any 30-second window in the last 5 minutes where the error rate exceeded 5%.

💡

Not sure which Prometheus metric type to use? Break it down with our easy guide here.

Conclusion

Prometheus functions transform raw metrics into actionable insights, giving you unprecedented visibility into your systems. From basic queries to advanced alerting rules, PromQL offers a flexible language that grows with your monitoring needs.

💡

And if you want to chat more, our Discord community is open! We’ve got a dedicated channel where you can discuss your use case with other developers.

FAQs

What are Prometheus Functions?
Prometheus functions are operations within PromQL (Prometheus Query Language) that let you:

Transform, aggregate, and analyze metrics data
Perform simple math operations or complex time-series manipulations

They fall into categories such as:

Aggregation functions (sum, avg, max)
Rate functions (rate, irate)
Prediction functions (predict_linear)
Time-based functions (hour, day_of_week)

These functions turn raw metrics into actionable insights for monitoring and alerting.

How does the query speed compare to similar tools like Grafana Loki?

Prometheus significantly outperforms Grafana Loki for metric queries due to its time-series database.
Loki excels at log storage and text-based queries.
Prometheus can execute complex metric aggregations across thousands of time series in milliseconds.
This speed advantage comes from Prometheus’s specialized indexing and PromQL’s optimization for numerical time-series analysis.
For metric-based monitoring, Prometheus queries can be 10-100x faster than extracting and calculating metrics from logs in Loki.

What is a Time Series Metric?
A time series metric is a sequence of data points recorded at specific time intervals. In Prometheus, each metric consists of:

A name (e.g., http_requests_total)
A set of labels (key-value pairs like {status="200", method="GET"})
A series of timestamp-value pairs

The combination of metric name and unique label sets creates distinct time series that track specific measurements over time. These metrics form the foundation of monitoring and observability in Prometheus.

Can I use range vectors in alerting rules?
Yes, and you absolutely should! Range vectors are essential for creating meaningful, actionable alerts:

[5m] refers to the last 5 minutes of data.
Range vectors allow alerts based on trends, reducing noise from temporary spikes.
Functions like rate(), increase(), and avg_over_time() combined with range vectors help predict problems before they become critical.

Are there approaches for capturing spikes with PromQL?
Yes, Prometheus offers several methods:

max_over_time() identifies the highest value in a time range.
quantile_over_time() finds outliers.
Comparing short-term and long-term rates helps detect sudden changes:
- Example: rate(metric[1m]) / rate(metric[1h]) > 2
deriv() tracks how quickly a gauge is changing.
resets() identifies when counters restart (useful for detecting service restarts).

What are other use cases for Prometheus functions?
Beyond basic monitoring, Prometheus functions enable advanced use cases such as:

Capacity planning – predict_linear() forecasts resource needs.
SLO monitoring – histogram_quantile() tracks percentile-based performance.
Anomaly detection – Compare metrics to historical norms.
Cross-service correlation – Use label matching and aggregation.
Dynamic thresholds – Adjust alerting thresholds based on time-based functions.
Cost optimization – Track resource efficiency.
A/B testing analysis – Compare metrics across different service variants.

How do I use Prometheus functions to calculate rate of change?
Use the following functions:

For counter metrics (only increase):
- rate(counter[5m]) – Calculates per-second average increase over 5 minutes.
- irate(counter[5m]) – Calculates instant rate using the last two data points.
- increase(counter[5m]) – Gets absolute increases over 5 minutes.
For gauge metrics (values can go up or down):
- deriv(gauge[5m]) – Calculates the per-second derivative.

How do I apply Prometheus functions to aggregate data?
Prometheus excels at data aggregation using functions like:

sum, avg, min, max, count – Aggregate metrics across different dimensions.
The by clause to preserve specific labels:
- sum(http_requests_total) by (status_code) groups requests by status code.
Combining aggregation with rate functions:
- sum(rate(http_requests_total[5m])) by (service) – Shows request rate for each service.
Outlier detection:
- topk(5, metric) – Finds the top 5 highest values.
- bottomk(5, metric) – Finds the bottom 5 lowest values.
Percentile tracking for performance:
- histogram_quantile(0.95, metric) – Tracks the 95th percentile for request latencies.

These functions helps create powerful monitoring and alerting strategies for any infrastructure!