Your Prometheus dashboard shows 847 CPU metrics. The alert fired—but is the problem in us-east or us-west? You're trying to rule out whether that new feature caused a latency spike, but the sheer number of time series isn’t helping.
Grouping can make this manageable. By organizing metrics by shared label values, you can quickly spot which service or region is behaving differently, without digging through every metric.
This blog walks through a few grouping patterns that make incident debugging faster and less chaotic.
Grouping and Aggregation in Prometheus
Here's the core problem: without grouping, you might have 1000+ individual time series. With proper grouping, you get 5-10 meaningful aggregations that help you make decisions.
Here's why this is important: 3 services × 10 instances × 2 regions = 60 individual time series to mentally process during an outage. Grouping reduces this to 3 service-level metrics you can act on.
Prometheus treats each unique combination of labels as a separate time series. When you group by specific labels, you're telling Prometheus which dimensions to preserve and which to collapse during aggregation.
# Without grouping - 50 individual time series
cpu_usage_percent{instance="web-1", service="api", region="us-east"}
cpu_usage_percent{instance="web-2", service="api", region="us-east"}
cpu_usage_percent{instance="web-3", service="api", region="us-west"}
# ... 47 more time series
# With grouping - 1 actionable metric per service
sum(cpu_usage_percent) by (service)
During an incident, you don't care about individual instance metrics. You need to know which service is having problems so you can route the alert to the right team.
Difference Between group by
and sum by
in Prometheus
In Prometheus, group by
is an operator that tells the system which labels to preserve when aggregating metrics. It doesn’t perform aggregation itself—it just defines how data should be grouped.
On the other hand, sum by (...)
combines the sum()
aggregation function with group by
. It adds up values across matching time series, keeping only the specified labels in the result.
The key difference:
- Aggregation answers: What happened?
- Grouping answers: Where it happened?
Service-Level Monitoring
One of the most practical uses of grouping is service-level monitoring. Most developers need answers to questions like: Is my service running?, How much traffic is it getting? Or where are errors showing up?
Here's how to move from basic checks to more detailed analysis.
Step 1: Check if your service is running
up{job="my-service"} == 1
This returns 1
if the service is up, and 0
if it’s down.
Use this to track basic uptime or as part of a dependency check.
Step 2: Measure request volume per service
sum(rate(http_requests_total[5m])) by (service)
This shows how many requests per second each service is handling.
Useful for capacity planning, scaling decisions, or spotting uneven traffic distribution.
Step 3: Find where the errors are coming from
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, endpoint)
This helps identify which service and which endpoint is returning 5xx errors.
It’s often used during incident response or immediately after a deployment to spot regressions.
Keep an Eye on Time Series Count
Every label you add to a grouping increases the number of time series Prometheus has to store and process. For example:
- Grouping by
service
might return 3 time series - Grouping by
service
,region
might return 9 - Add
environment
, and you’re up to 27
Start with just service
, then add dimensions like region
or environment
only when you need that additional context.
For a reliable, low-noise overview of how your services are performing, this pattern works well:
sum(metric) by (service)
It gives you total usage or error counts by service without pulling in unnecessary detail.
Resource Aggregation
Here are a few practical queries that use sum by
to answer operational questions:
Total CPU usage per service
sum(cpu_usage_percent) by (service)
Answers: Which service is consuming the most CPU?
Request rate per endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
Answers: Which endpoints are receiving the most traffic?
Memory usage per team
sum(memory_usage_bytes) by (team)
Answers: Which teams are using more than their allocated memory?
Each of these queries gives you high-level visibility. Without grouping, you'd have to manually look at or sum dozens of metrics.
How to Use group by
in PromQL
The by
clause in Prometheus comes right after an aggregation function. It tells Prometheus which labels to keep in the result, allowing you to group related time series, like by service, region, or environment.
This lets you move from individual metrics to more useful summaries that still carry enough context to debug or analyze patterns.
Start with the basic syntax
aggregation_function(metric_name) by (label1, label2)
This groups your data by the listed labels. For example, you might want total request counts grouped by service
, or CPU usage grouped by region
.
Add more labels when you need more context
sum(rate(http_requests_total[5m])) by (service, environment, region)
This query gives you per-service request rates, broken down by environment and region. It’s useful when the same service runs in multiple places, and you need to compare behavior across them.
Remove labels that add noise
sum(cpu_usage_percent) without (instance)
In some cases, labels like instance
just create too many time series without adding much value. Using without
helps simplify your charts by excluding labels you don’t care about, while still aggregating the data meaningfully.
Control Query Complexity with Progressive Grouping
Start simple and add label dimensions only when the extra detail is necessary. This helps keep your queries fast and your dashboards readable.
Grouping by service
(baseline)
sum(rate(http_requests_total[5m])) by (service)
Gives you a per-service traffic view—ideal as a default for most monitoring dashboards.
Grouping by service
and environment
sum(rate(http_requests_total[5m])) by (service, environment)
Separates traffic by environment (e.g., prod
, staging
). Useful when services share names across environments.
Grouping by service
, environment
, and region
sum(rate(http_requests_total[5m])) by (service, environment, region)
Adds regional visibility. Helps identify geographic latency, error patterns, or uneven load distribution.
Understand the Cost of Extra Labels
Each added label multiplies your time series count. That impacts performance, storage, and readability.
Example:
3 services × 2 environments × 3 regions = 18 time series
Just grouping by service
= 3 time series
If your queries are slow or your dashboards feel noisy, this is often the reason. Start with the coarsest view that answers your question, and refine only when needed.
rate()
, increase()
, and quantile()
, this blog covers it well.Grouping Patterns That Scale in Production
These patterns are commonly used in production setups with dozens or even hundreds of services. They help you spot issues faster and make better scaling decisions, without drowning in raw metrics.
Service-level error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
> 0.05
Highlights services with a 5xx error rate above 5% over the last 5 minutes. Useful for alerting and quick post-deploy checks.
Top CPU-consuming services
topk(10, sum(cpu_usage_percent) by (service))
Helps you identify which services are consistently using the most CPU, ideal for capacity planning or autoscaling decisions.
Regional latency breakdown (P95)
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (region, le)
)
Calculates the 95th percentile request latency per region. Useful when debugging slow user experiences or comparing performance across data centers.
Troubleshoot Group By Issues
When grouped queries return empty results or unexpected values, debug them step by step.
Step 1: Make sure the metric exists
{__name__=~"http_requests_total"}
Step 2: Verify label values
count by (service) ({__name__=~"http_requests_total"})
This tells you which service
values are available and confirms whether your label is spelled correctly.
Step 3: Build the query incrementally
Start with the basic aggregation and add a grouping step-by-step:
sum(rate(http_requests_total[5m]))
sum(rate(http_requests_total[5m])) by (service)
In production setups, it’s best practice to always include the environment
label in your grouping. It keeps staging and production data separate, especially when multiple environments share service names.
group_left
and group_right
in Prometheus
Sometimes, you need to enrich your metrics by joining two sets of time series. For example, your core metrics may not include team ownership, version tags, or deployment info, but you have that context available in a separate metric.
This is where group_left
and group_right
help you.
When You Need Metric Joining
Imagine this situation:
http_requests_total{service="api", instance="web-1"}
This metric shows requests, but it doesn’t include useful context like the owning team or the deployed version.
You have that metadata elsewhere:
service_info{service="api", team="backend", version="v1.2.3"}
To combine these two, you need a Prometheus vector matching operation.
How group_left
Works
Use group_left
when the left-hand side of the query has more time series than the right.
For example:
- The left side has multiple
instance
metrics per service - The right side has just one
service_info
entry per service
sum(rate(http_requests_total[5m])) by (service)
* on(service) group_left(team, version)
service_info
This joins the request rate with the service_info
, attaching team
and version
labels to the aggregated request data. The final result includes both performance data and metadata, useful for filtering or breaking down metrics by ownership.
When to Use group_right
Use group_right
when the right-hand side has more time series.
For example:
- Left side: a single metadata record per service
- Right side: many metrics with instance-level detail
service_info{service="api", team="backend"}
* on(service) group_right(instance, pod)
sum(rate(http_requests_total[5m])) by (service, instance, pod)
This keeps the more detailed series on the right (with instance
and pod
labels), and adds in the team
label from the metadata on the left.
Full Join Example:
Here’s a use case where you want to understand if recent deployments are causing increased error rates.
First, calculate 5xx error rates per service:
(
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
)
Then, add deployment metadata:
* on(service) group_left(deploy_time, version)
deployment_info
This final query gives you error rates annotated with version and deploy time, so you can correlate issues with recent releases.
counter
or gauge
for your metrics, this explanation of Prometheus gauges vs counters offers practical guidance.Limitations of Prometheus Labels
In Prometheus, each unique combination of metric labels creates a separate time series. This is manageable in development, but can quickly become a problem at production scale.
Let’s say you add a user_id
label to track per-user metrics:
sum(rate(user_requests_total[5m])) by (user_id)
With 100,000 users, that’s 100,000 time series—just for one metric.
What this means in practice:
- The query might take 45 seconds to return
- It could use 8GB+ of memory
- It’s likely to fail or timeout during high-traffic incidents
Instead, aggregate by a lower-cardinality label that still gives useful insights:
sum(rate(user_requests_total[5m])) by (user_tier)
This reduces load dramatically while still giving you actionable data.
Label Constraints That Impact Performance
Naming Rules
- Labels must match:
[a-zA-Z_:][a-zA-Z0-9_:]*
- Labels starting with
__
are reserved for internal Prometheus use - Labels are case-sensitive (
Service
≠service
)
Performance Guidelines
- Keep total active time series per Prometheus instance under 10 million
- Avoid long label values; they increase memory usage even if you group smartly
Watch for growth using:
prometheus_tsdb_symbol_table_size_bytes
Manage High-Cardinality Labels in Practice
Here are strategies used in production environments to reduce cardinality without losing too much visibility.
Use label_replace()
to bucket values
Instead of tracking exact HTTP status codes (e.g., 200, 201, 404), group them into classes:
label_replace(
http_requests_total,
"status_class",
"${1}xx",
"status",
"([0-9])[0-9][0-9]"
)
Now you can group by status_class
(2xx, 4xx, 5xx), which reduces the time series and still helps during debugging.
Drop unnecessary labels at query time
Labels like request_id
or trace_id
change on every request and don’t help at the aggregation level. Use without()
to strip them:
sum(rate(http_requests_total[5m])) without (request_id, trace_id)
This keeps your queries efficient while focusing only on relevant dimensions.
Use topk()
to limit results
When you do need high-cardinality insights (like per-user breakdowns), use topk()
to avoid pulling in the full dataset:
topk(20, sum(rate(http_requests_total[5m])) by (user_id))
This gives you the top 20 users by traffic volume, without querying all 100k.
The Role of Job Label in Prometheus
The job
label identifies the type of target being scraped. It's your first line of organization in Prometheus, automatically added based on your scrape configuration.
# prometheus.yml
scrape_configs:
- job_name: 'web-servers'
static_configs:
- targets: ['web-1:8080', 'web-2:8080']
- job_name: 'databases'
static_configs:
- targets: ['db-1:9100', 'db-2:9100']
Every metric gets a job label that tells you what type of service it came from.
Job-Based Monitoring Patterns
# Monitor scrape success by job type (answers: "Is my monitoring working?")
(sum(up) by (job) / count by (job) (up)) * 100
# Resource usage by service type (answers: "Which service type uses most resources?")
sum(rate(cpu_usage_total[5m])) by (job)
# Alert on critical service failures
up{job="critical-service"} == 0
Integration with Grafana
Job labels work perfectly for Grafana dashboard organization:
# Dashboard variable query
label_values(up, job)
# Dynamic dashboard based on job selection
sum(rate(http_requests_total{job="$job"}[5m])) by (instance)
Kubernetes Job Label Example
For Kubernetes environments, job labels typically map to service types:
# Pod CPU usage by service type
sum(rate(container_cpu_usage_seconds_total[5m])) by (job)
# Network traffic by service type
sum(rate(container_network_receive_bytes_total[5m])) by (job)
# Memory usage across all pods of a job
sum(container_memory_working_set_bytes) by (job)
In Kubernetes, use job names that match your service architecture. Common patterns: web-frontend
, api-backend
, data-processing
.
A Few Starter Queries
Essential Service Monitoring
# Service request rate (requests/second)
sum(rate(http_requests_total[5m])) by (service)
# Service error rate (percentage)
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) /
sum(rate(http_requests_total[5m])) by (service) * 100
# Service uptime (percentage)
avg(up) by (service) * 100
# Top 10 services by CPU usage
topk(10, sum(cpu_usage_percent) by (service))
Resource Monitoring
# Memory usage by service
sum(memory_usage_bytes) by (service)
# Network traffic by service
sum(rate(network_bytes_total[5m])) by (service)
# Disk I/O by service
sum(rate(disk_io_bytes_total[5m])) by (service)
Performance Monitoring
# 95th percentile response time by service
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le)
)
# Request latency by endpoint
histogram_quantile(0.50,
sum(rate(http_request_duration_seconds_bucket[5m])) by (endpoint, le)
)
Wrapping Up
Grouping metrics is one of the most powerful tools Prometheus gives you, but as systems grow, it also becomes one of the trickiest to get right. High-cardinality labels, complex query logic, and noisy dashboards can easily slow you down just when you need answers the most.
If this feels familiar, you don't have to patch it all manually. Last9 - an OTel and Prometheus-compatible telemetry data platform is built to handle these issues.
Cardinality Explorer to help you understand what’s happening, Streaming Aggregation to keep metrics manageable, and PromQL macros to reduce repetitive query work. With built-in support for LogQL, TraceQL, and Prometheus remote write, you can bring everything together without reworking your setup.
If you want to understand how this could work in your environment, book sometime with us – we'll be happy to walk you through it.
FAQs
What is the difference between group by and sum by in Prometheus?group by
defines which labels to keep when aggregating data. sum by
combines the sum()
aggregation function with group by
to add up matching time series while retaining only the specified labels.
How to use group by in PromQL?
Place the by
clause after an aggregation function like sum()
, avg()
, or count()
. For example:
sum(http_requests_total) by (service, region)
This aggregates the metric and groups results by service
and region
.
What is the difference between group_left and group_right in Prometheus?group_left
is used when the left side of a vector match has more time series than the right.group_right
is for the opposite case—when the right side has more time series. These are used to join metrics that have different cardinalities, such as pairing metrics with metadata.
What are the limitations of Prometheus labels?
Labels with high cardinality, like user_id
or request_id
—can generate too many time series, slowing down queries and increasing memory usage. Labels must follow specific naming rules, can't start with __
, and are case-sensitive.
What is the job label in Prometheus?
The job
label identifies the target being scraped. It’s automatically assigned by Prometheus and helps group metrics by application, service, or source as defined in the scrape configuration.
How do I write a Prometheus query that returns the value of a label?
Use functions like label_values(metric_name, label_name)
or aggregate with count by (label)
. For example:
count by (service) (http_requests_total)
This returns the number of time series per service.
Does a metric with colons become a string?
No. Colons are allowed characters in metric names. They are often used for namespacing or naming conventions, but don’t change the type or behavior of the metric.
What are the Prometheus queries to monitor Kubernetes pod CPU and memory?
For CPU:
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)
For memory:
sum(container_memory_working_set_bytes) by (pod)
These track per-pod CPU usage rate and active memory consumption.
How can I group labels in a Prometheus query?
Use by (label1, label2)
to retain specific labels or without (label)
to exclude one or more. For example:
sum(cpu_usage) by (region)
sum(memory_usage) without (instance)
Why count unique label values?
Counting unique values helps measure cardinality, which is important for performance. Use queries like:
count by (label_name) (metric_name)
to see how many unique label values exist.
How do I use the "group by" function with labels in Prometheus queries?
Apply by (label)
after an aggregation function. For example:
avg(cpu_usage_seconds_total) by (service)
This gives the average CPU usage grouped by service.
How do I use the "group by" function to aggregate metrics by label in Prometheus?
Combine aggregation and grouping like this:
sum(rate(http_requests_total[5m])) by (service)
This gives the request rate per service by summing over time.
How do I use label grouping to aggregate metrics in Prometheus?
Use sum(metric) by (label)
to retain only those labels, or sum(metric) without (label)
to drop noisy or irrelevant ones. This controls how time series are grouped in the final result.
How does grouping by labels affect data aggregation in Prometheus?
Grouping controls how time series are combined. Each unique combination of grouped label values results in a separate line in your aggregated output, so careful selection of labels is key to both performance and readability.