Easily Query Multiple Metrics in Prometheus

In monitoring setups, working with a single metric rarely tells the complete story. The real power of Prometheus lies in its ability to query multiple metrics simultaneously, creating connections between different data points that reveal the true state of your systems.

This guide will walk you through everything you need to know about crafting effective multi-metric queries in Prometheus – from basic concepts to advanced techniques that will help you monitor and troubleshoot your infrastructure.

The Business Value of Combining Multiple Prometheus Metrics

When you query multiple metrics in Prometheus, you unlock a new level of observability. Instead of looking at CPU usage, memory consumption, or request latency in isolation, you can correlate these metrics to understand how they affect each other.

For example, a spike in memory usage might coincide with increased latency. By querying these metrics together, you can quickly identify the relationship and take appropriate action.

The benefits include:

Faster root cause analysis
More accurate alerting with fewer false positives
Better capacity planning
Clearer visualization of system behaviors
Enhanced troubleshooting capabilities

💡

If you're also working with traces, here's how you can combine metrics and tracing using Prometheus with distributed tracing.

Essential PromQL Syntax for Combining Multiple Metrics

Let's start with the fundamentals. Prometheus uses PromQL (Prometheus Query Language) to query metrics. When working with multiple metrics, you'll need to understand these basic operations:

Using Arithmetic Operators to Transform Metrics

You can use basic arithmetic operators (+, -, *, /, %, ^) to combine metrics and create meaningful relationships:

http_requests_total{status="200"} / http_requests_total

This query calculates the ratio of successful requests to total requests, giving you an immediate success rate percentage.

Understanding Vector Matching Rules for Metric Correlation

When combining metrics, Prometheus needs to know how to match time series. There are two main types:

One-to-one matching: Each time series from the left side matches with exactly one time series from the right side.
Many-to-one/one-to-many matching: One side can match with multiple time series from the other side.

For example:

http_requests_total{job="api-server"} / on(instance) node_cpu_seconds_total{mode="idle"}

This query divides the total HTTP requests by idle CPU time for each instance, showing you resource efficiency per request.

Practical Multi-Metric Query Patterns for Daily Operations

Now that you understand the basics, let's look at common patterns you'll use daily:

Calculating Error Rates and Success Ratios for Service Reliability

Ratios often provide more meaningful insights than raw numbers:

sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

This shows the error rate as a percentage of total requests, giving you an immediate view of service health.

Measuring Application Resource Utilization Across System Boundaries

process_resident_memory_bytes{job="app"} / node_memory_MemTotal_bytes{job="node"}

This calculates the percentage of total system memory used by your application, helping you identify resource hogs.

Building Composite Health Scores for Holistic System Monitoring

(node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 0.5 + (node_filesystem_avail_bytes / node_filesystem_size_bytes) * 0.5

This creates a weighted health score based on available memory and disk space, providing a single metric for overall system health.

💡

For help setting up metrics collection, check out our guide to getting started with Prometheus metrics endpoints.

Advanced Correlation Techniques for Complex System Analysis

Ready to level up? These advanced techniques will help you get even more from your Prometheus queries:

Controlling Vector Matching with Binary Operators for Precise Metric Correlation

Control how time series are matched using modifiers:

http_requests_total{job="api"} / ignoring(method) http_requests_total{job="api"}

The ignoring modifier tells Prometheus to match time series even if the method label differs, allowing for broader comparisons.

Similarly, the on modifier specifies which labels must match:

process_cpu_seconds_total{job="app"} / on(instance) node_cpu_seconds_total{mode="idle"}

This matches series only when the instance label is identical, ensuring you're comparing metrics from the same source.

Managing Many-to-One Relationships with Group Modifiers for Hierarchical Data

When working with many-to-one or one-to-many relationships, you need group_left or group_right modifiers:

sum by (job) (rate(http_requests_total[5m])) / on(job) group_left sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))

This complex query calculates error rates for each job, using group_left to handle the many-to-one relationship.

Here's another example with group_right:

node_filesystem_avail_bytes / on(instance) group_right node_filesystem_size_bytes

The difference between group_left and group_right is simply which side of the operation has the "many" series.

Implementing Dynamic Thresholds with Cross-Metric Functions

clamp_max(rate(http_requests_total[5m]), rate(http_requests_limit[5m]))

This limits the request rate to a defined threshold, which is itself a metric, creating adaptive alerting boundaries.

Cross-Referencing System States with Label-Based Metric Joining

When you need to combine metrics that share common labels:

node_memory_Active_bytes{instance="host-01"} and on(instance) node_cpu_seconds_total{mode="user", instance="host-01"}

This returns data only when both memory and CPU metrics exist for the same instance, helping identify correlated resource usage.

💡

To understand metrics better, here's how to use histogram buckets in Prometheus for precise data analysis.

Production-Ready Multi-Metric Queries for DevOps and SRE Teams

Theory is great, but let's see how this works in practice with real-world examples:

Measuring User Experience with Service Level Indicator (SLI) Calculations

sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m]))

This calculates the success rate, a common SLI, giving you a direct measurement of service quality as experienced by users.

Cross-Environment Performance Comparison for Deployment Validation

sum by (environment) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (environment) (rate(http_requests_total[5m]))

This query compares error rates across different environments (dev, staging, production), helping identify issues before they reach production.

Container Resource Efficiency Analysis for Cost Optimization

rate(container_cpu_usage_seconds_total[5m]) / on (pod) group_left container_spec_cpu_quota

This shows how efficiently your containers use their CPU allocations, identifying opportunities for resource optimization and cost savings.

User Satisfaction Measurement with Apdex Score Implementation

(sum(rate(http_requests_bucket{le="0.3"}[5m])) by (service) + sum(rate(http_requests_bucket{le="1.2"}[5m])) by (service) - sum(rate(http_requests_bucket{le="0.3"}[5m])) by (service)) / 2 / sum(rate(http_requests_count[5m])) by (service)

This calculates an Apdex score (satisfied + tolerating/2) divided by total requests, providing a standardized measurement of user satisfaction.

Performance-Resource Correlation for Bottleneck Identification

rate(application_request_duration_seconds_sum[5m]) / rate(application_request_duration_seconds_count[5m]) / on(instance) group_left node_memory_utilization

This helps identify if memory usage affects request duration, pinpointing potential resource bottlenecks affecting application performance.

💡

You can enhance your querying skills further by exploring these useful 21 PromQL tricks.

Performance Optimization Strategies for Complex Multi-Metric Queries

Working with multiple metrics can impact Prometheus performance. Here's how to keep your queries efficient:

Cardinality Management Techniques for Lower Resource Consumption

sum(rate(http_requests_total[5m])) by (service, endpoint)

Instead of keeping all labels, focus on what's important for your analysis, drastically reducing the number of time series processed and stored.

Implementing Recording Rules for Frequently Used Calculations

Recording rules pre-compute expensive queries:

- record: job:http_requests_total:rate5m
  expr: sum(rate(http_requests_total[5m])) by (job)

This makes it faster to use this calculation in multiple places, reducing computation overhead and improving dashboard performance.

Time Window Optimization for Query Efficiency and Memory Usage

Shorter time windows mean less data to process:

rate(http_requests_total[1m]) / rate(http_requests_total{status="200"}[1m])

Using 1m instead of 5m or 15m reduces the computational load by processing fewer data points while still providing meaningful results for most use cases.

Troubleshooting Guide for Common Multi-Metric Query Challenges

Even experienced SREs encounter issues with multi-metric queries. Here's how to solve them:

Diagnosing and Resolving Empty Query Results

If your query returns no data, check:

Whether both metrics exist in the time range you're querying
If label matching is preventing combinations
Whether rate() functions have enough data points

Try this diagnostic approach:

count(metric1) or count(metric2)

If one returns data and the other doesn't, you've found your issue and can focus on the missing metric.

💡

Now, fix Prometheus metrics issues instantly—right from your IDE, with AI and Last9 MCP. Bring real-time production context—logs, metrics, and traces—into your local environment to auto-fix code faster.

Fixing Anomalies in Ratio and Division Operations

When dividing metrics, watch out for:

Division by zero (use or vector(0) to handle this)
Missing labels causing incorrect matching
Different recording frequencies between metrics

Implement this safe division pattern:

metric1 / (metric2 > 0 or vector(0))

Preventing Memory Exhaustion with High-Cardinality Metrics

Complex queries with high cardinality can cause OOM errors:

Reduce the number of time series with aggregation
Limit the time range queried
Use recording rules for frequent queries

These strategies ensure your Prometheus instance remains stable even when processing complex correlation queries.

💡

To make sure Prometheus collects your metrics correctly, check out our quick guide on Prometheus port configuration.

Specialized PromQL Functions for Advanced Multi-Metric Analysis

Prometheus offers powerful functions specifically designed for working with multiple metrics:

Boolean Logic Operations for Service Dependency Mapping

These logical operators help combine or filter metrics:

up{job="api"} and on(instance) up{job="database"}

This returns instances where both API and database services are running, helping identify critical infrastructure dependencies.

Service Availability Monitoring with Absence Detection Functions

absent(up{job="critical-service"})

This alerts when a critical metric disappears completely, providing immediate notification of service outages or monitoring gaps.

Latency-Resource Correlation for Performance Root Cause Analysis

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)) / on(service) group_left node_cpu_utilization

This correlates 95th percentile request durations with CPU utilization, helping identify whether resource constraints are causing performance issues.

How to Create Effective Multi-Metric Dashboards for Operational Visibility

Querying multiple metrics is most powerful when visualized together. Here's how to build an effective dashboard:

Service Health Visualization Through Metric Grouping

Put these panels together to create a complete service health view:

Request rate (traffic volume)
Error rate (service reliability)
Latency (user experience)
Resource utilization (infrastructure health)

This grouping provides a comprehensive view of service health from both user and system perspectives.

Ensuring Data Consistency with Synchronized Time Windows

Ensure all queries use the same time window for accurate correlation:

rate(http_requests_total[5m])
rate(errors_total[5m])
rate(duration_seconds_sum[5m]) / rate(duration_seconds_count[5m])

Consistent time windows prevent misleading correlations caused by temporal misalignment.

Displaying Key Performance Indicators with Ratio-Based Panels

Add panels that directly display relationships:

sum(rate(errors_total[5m])) / sum(rate(http_requests_total[5m]))

These ratio panels provide immediate insight into service health without requiring mental calculations.

💡

Last9 enables you to build custom log analytics dashboards to visualize log data using aggregated metrics. Checkout our guide shows how to set up log queries and turn them into effective visualizations for monitoring.

Enhance Observability with Last9 - Unified Telemetry Platform

If you're juggling multiple Prometheus queries and finding it challenging to correlate metrics effectively, Last9 offers a streamlined solution. As a telemetry data platform built for high-cardinality observability, we make working with multiple metrics more intuitive.

Last9 brings together metrics, logs, and traces from your existing Prometheus setup, creating a unified view that makes correlation immediate and actionable. Our platform handles the heavy lifting of connecting data points, so you don't need to write complex PromQL for every analysis.

We've successfully monitored some of the largest live-streaming events in history, proving our capability to handle extreme observability demands without compromise.

Talk to us to know more about the platform capabilities, or you can get started for free too!

FAQs

How do I combine metrics with different label structures in Prometheus?

Use the ignoring() or on() modifiers to specify which labels should be considered for matching:

metric1 / ignoring(label_to_ignore) metric2

This selective matching allows you to work with metrics that have partially overlapping label sets.

What's the difference between `group_left` and `group_right` modifiers for many-to-one relationships?

These modifiers indicate which side of the operation has multiple time series that can match with a single time series on the other side:

group_left: The right-side vector has one series that matches multiple series on the left side
group_right: The left-side vector has one series that matches multiple series on the right side

Example with group_left:

node_cpu_seconds_total{mode="user"} / on(instance) group_left node_num_cpus

Here, each instance has multiple CPU metrics (one per core/mode) but only one value for the total number of CPUs.

How can I compare current metrics with historical values for trend analysis?

Yes, you can use offset modifiers to query from different time ranges:

sum(rate(http_requests_total[5m])) / sum(rate(http_requests_total[5m] offset 1d))

This compares current traffic with traffic from 1 day ago, allowing you to detect anomalies or validate improvements.

What techniques exist for joining metrics with completely different label schemas?

You need to transform the metrics to introduce common labels:

label_replace(metric1, "new_label", "$1", "old_label", "(.+)") / on(new_label) metric2

This extracts values from old_label in metric1, creates a new label called new_label, and then joins with metric2 on that label, allowing correlation between otherwise incompatible metrics.

How can I normalize metrics with different units and scales for valid comparison?

Use multiplication or division to normalize metrics to comparable scales:

(node_memory_used_bytes / 1024 / 1024) / (node_cpu_usage_percent)

This technique allows you to create meaningful ratios between metrics with different units, such as comparing memory usage to CPU utilization.

What's the most effective approach for quantifying correlation between metrics?

For simple correlation, plot both metrics on the same graph. For numerical correlation, you can use this approach:

count(
  (group by (instance) (rate(http_requests_total[5m])) > bool 10)
  and
  (group by (instance) (node_cpu_usage_percent) > bool 80)
) / count(group by (instance) (rate(http_requests_total[5m])) > bool 10)

This gives you the percentage of instances where high request rates coincide with high CPU usage, providing a statistical measure of correlation.

How do I implement a multi-dimensional health score that combines various system metrics?

Normalize each metric to a 0-1 scale and combine them with weights based on their importance:

(1 - (node_memory_used_bytes / node_memory_total_bytes)) * 0.5 +
(1 - (node_disk_used_bytes / node_disk_total_bytes)) * 0.3 +
(1 - max_over_time(node_cpu_usage_percent[5m])/100) * 0.2

This weighted approach creates a comprehensive health score that reflects your system's priorities.