When developers monitor application performance, they pick one of two paths: traditional APM tools with distributed tracing and code profilers, or metrics-driven monitoring with Prometheus. The second approach — Prometheus APM — tracks the signals that matter most: request rates, error rates, latency, and resource utilization. No agents to install, no per-host pricing, just exporters and PromQL.
For most teams, Prometheus APM is where monitoring starts. You instrument your services, expose /metrics
endpoints, and build dashboards around the RED method. It's lightweight, runs anywhere, and gives you performance visibility without the overhead of traditional APM.
In the beginning, that's enough — one cluster, a handful of exporters, and a couple of dashboards to watch traffic and errors.
As your systems grow, the experience changes. You add more services, more labels, and more alerts, and suddenly Prometheus feels heavier. Queries that used to return instantly now drag, storage balloons with high-cardinality metrics, and your alert channels fill with noise.
The challenge isn't getting metrics into Prometheus — it's keeping Prometheus fast, cost-efficient, and reliable as you scale.
In our previous post on Prometheus APM, we covered how to set up Prometheus for application performance monitoring — from installing exporters and writing your first PromQL queries to building dashboards and setting up basic alerts.
This post picks up where that left off. We'll cover the common issues of running Prometheus APM at scale, strategies you can use to keep it manageable in production, and how Last9 extends Prometheus and Grafana with service discovery, cost controls, and unified telemetry for dynamic environments.
Why Scaling Prometheus APM Matters
With Prometheus, you can start small — a single app, a few exporters, and a Grafana dashboard give you a clear picture of traffic, errors, and infrastructure health. It's lightweight, simple to set up, and helps you see results right away.
The setup that worked fine for a handful of services starts to feel stretched when services expand. More microservices mean more metrics and more labels, and that changes the workload Prometheus has to handle. You'll notice:
- Targets cycling in and out as containers spin up and down in clusters
- Queries taking longer as Grafana panels process millions of series
- Storage demands growing with high-cardinality labels
- Alerts piling up faster than teams can respond
The Challenges of Scaling Prometheus APM
Cardinality Growth
Adding labels increases the number of unique time series Prometheus must track. A label like user_id
quickly multiplies into millions of series in a busy system. This drives up memory usage and makes queries slower, since every series must be evaluated.
Here's what happens with cardinality:
http_requests_total{method="GET", path="/api/users"} # 1 series
http_requests_total{method="GET", path="/api/users", user_id="123"} # Now thousands or millions
Each unique combination of labels creates a new time series. In a system with 100,000 active users and 50 endpoints, you're looking at 5 million series from a single metric.
Query Performance
PromQL queries that run quickly in small environments become expensive at scale. Aggregations across hundreds of thousands of series take time, and Grafana dashboards that once loaded instantly start to lag or time out.
A query like this:
rate(http_requests_total[5m])
Works fine with 1,000 series. At 100,000 series, it starts to slow. At 1 million series, you're waiting 10+ seconds for a dashboard to load.
Storage and Retention
Prometheus stores all samples on the local disk. With high scrape intervals and long retention, data grows fast — hundreds of gigabytes for large environments. Storage pressure affects more than disk space; it also slows down compactions and restarts.
A typical calculation:
- 1 million active series
- 15-second scrape interval (4 samples/minute)
- 2 bytes per sample
- 30-day retention
That's roughly 345 GB of storage, and Prometheus needs to keep this in memory during queries.
Alert Fatigue
As you monitor more services, alert rules expand too. Separate alerts for every pod restart or target outage flood channels with noise. Teams spend time chasing symptoms instead of seeing the bigger issue, and critical signals risk being overlooked.
You might see 50 alerts fire when a single deployment goes wrong — each pod restarting triggers an alert, but the root cause is one bad config.
Dynamic Infrastructure
In cloud and Kubernetes setups, pods and hosts are constantly starting and stopping. Static scrape configs can't keep up, which leads to targets flipping between UP and DOWN, or missing entirely if they aren't updated in time.
Kubernetes service discovery helps, but you still need to manage scrape configs for each namespace, label selector, and service type.
5 Strategies for Scaling Prometheus APM
Instead of a full rearchitecture, a few smart adjustments can help you keep performance steady and dashboards responsive.
1. Precompute with Recording Rules
Some queries — like percentile calculations or complex aggregations — are expensive to run repeatedly. Recording rules help you store the output of these queries as new metrics, so you're not recalculating every time Grafana loads a panel.
For example:
groups:
- name: api_performance
interval: 30s
rules:
- record: api:http_request_duration_seconds:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
- record: service:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (service, method, status)
Instead of running histogram_quantile()
on raw latency histograms for every query, you record pre-aggregated latency metrics. Create a rule that summarizes rate(http_requests_total[5m])
by service or region, so you can use it instantly in dashboards.
This approach reduces query load, speeds up visualizations, and keeps performance predictable even as your dataset expands.
Recording rules add storage overhead (you're creating new series), but they save far more in query time. Balance this by only precomputing queries you run frequently.
2. Balance Load with Federation
When a single Prometheus server starts to handle too many targets, federation helps you distribute the work.
# Central Prometheus scrapes from regional instances
scrape_configs:
- job_name: 'federate-us-east'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="api-service"}'
- '{job="web-frontend"}'
static_configs:
- targets:
- 'prometheus-us-east:9090'
- job_name: 'federate-eu-west'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="api-service"}'
- '{job="web-frontend"}'
static_configs:
- targets:
- 'prometheus-eu-west:9090'
Run separate Prometheus instances per team, environment, or cluster. Use a central Prometheus to pull summarized metrics from those instances.
This setup lets each team operate independently while maintaining a unified top-level view for leadership or SREs. It's a clean way to scale horizontally without overloading a single instance.
You have 5+ clusters, separate teams managing their own services, or need regional isolation for compliance.
3. Use Remote Storage for Historical Data
Prometheus is great for recent metrics, but storing months of data locally can strain disk and query performance. Remote write solves this by offloading older samples to long-term storage backends like Thanos, Cortex, or Mimir.
remote_write:
- url: "https://your-remote-storage/api/v1/push"
queue_config:
capacity: 10000
max_shards: 50
min_shards: 1
max_samples_per_send: 5000
batch_send_deadline: 5s
Keep 15-30 days of data in Prometheus for fast access. Push older data to a remote store for historical trends and SLA analysis.
This way, Prometheus remains lightweight and responsive, while your historical data stays available when needed.
Storage comparison:
- Local Prometheus: Fast queries, limited retention (15-30 days typical)
- Remote storage: Slower queries, unlimited retention, cross-cluster queries
4. Scale Out with Sharding
If scrape targets continue to increase, you can shard Prometheus horizontally. Each Prometheus instance scrapes a subset of targets, and their data can later be queried or aggregated together.
# Shard 0 - scrapes odd-numbered pods
scrape_configs:
- job_name: 'kubernetes-pods-shard-0'
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
regex: '.*[13579]$'
action: keep
# Shard 1 - scrapes even-numbered pods
scrape_configs:
- job_name: 'kubernetes-pods-shard-1'
relabel_configs:
- source_labels: [__meta_kubernetes_pod_name]
regex: '.*[02468]$'
action: keep
Assign jobs or namespaces to different Prometheus servers. Use consistent labeling so data can be correlated when aggregated.
Sharding helps distribute ingestion load evenly and ensures no single instance becomes a bottleneck.
You're scraping 10,000+ targets from a single Prometheus instance, or memory usage consistently exceeds 80%.
5. Keep Dashboards Focused
Dashboards are often where scaling pain becomes visible first. Too many panels, metrics, or overlapping queries can make Grafana sluggish and hard to interpret.
To keep them clear and fast:
Design dashboards around the RED (Rate, Errors, Duration) framework for user-facing services:
# Rate - requests per second
sum(rate(http_requests_total[5m])) by (service)
# Errors - error rate as percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service) * 100
# Duration - 95th percentile latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))
Use the USE (Utilization, Saturation, Errors) method for system resources like CPU or memory:
# Utilization - percentage of resource used
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Saturation - load average
node_load1 / count by (instance) (node_cpu_seconds_total{mode="system"})
# Errors - OOM kills, disk errors
rate(node_vmstat_oom_kill[5m])
Avoid redundant panels — focus on metrics that directly signal health or performance changes.
A focused dashboard not only loads faster but also helps teams spot issues faster, reducing alert fatigue and analysis time.
Design Alerts That Reflect Service Health
A good alert should describe a change in service behavior — not just a metric crossing an arbitrary threshold. The goal is to detect user-impacting issues early without drowning teams in repetitive noise.
Here's how you can make alerts in Prometheus APM more effective and scalable:
Group with labels: Add labels such as service
, environment
, or team
in your alert rules. This lets Alertmanager group related alerts instead of firing one per pod or host.
groups:
- name: api_alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service, environment)
/ sum(rate(http_requests_total[5m])) by (service, environment) > 0.05
for: 3m
labels:
severity: critical
team: backend
annotations:
summary: "High error rate on {{ $labels.service }} in {{ $labels.environment }}"
description: "Error rate is {{ $value | humanizePercentage }} over the last 5 minutes"
An alert for api-service
in production is easier to triage than hundreds of instance-level warnings.
Focus on user-facing signals: Track metrics that indicate real service impact — request rate, error rate, or latency.
- alert: HighRequestLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 0.5
for: 2m
labels:
severity: warning
team: platform
annotations:
summary: "95th percentile latency > 0.5s on {{ $labels.service }}"
description: "P95 latency is {{ $value }}s. Users are experiencing slow responses."
This alert evaluates the 95th percentile of request latency over a 5-minute window. If it stays above 0.5 seconds for 2 minutes, it triggers a warning.
Why 95th percentile? It captures what most users experience, filtering out the worst 5% of requests that might be outliers. Average latency can hide problems (one slow request averaging with 99 fast ones), and 99th percentile can be too noisy.
Why 2 minutes? The for
clause prevents alerts from firing on brief spikes. A single slow request shouldn't page anyone. Two minutes of sustained latency indicates a real problem.
Use routing and silencing: Configure Alertmanager routes based on labels (e.g., team=backend
, env=staging
) so notifications reach the right people.
route:
group_by: ['alertname', 'service', 'environment']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'team-backend'
routes:
- match:
team: frontend
receiver: 'team-frontend'
- match:
team: platform
receiver: 'team-platform'
- match:
severity: critical
receiver: 'pagerduty'
During planned deployments or maintenance, silence alerts to prevent unnecessary noise.
Instead of separate alerts for every pod, this approach gives you a clear signal that latency is increasing for a service — the level of abstraction that aligns with how users perceive outages.
How Last9 Extends Prometheus APM
Prometheus is often the starting point for monitoring — open source, easy to deploy, and deeply integrated with Kubernetes. It works well for smaller setups. But as services multiply, exporters expand, and dashboards grow, keeping Prometheus efficient starts to take more effort than analyzing what it collects.
Last9 builds on top of Prometheus and Grafana to remove that friction. The platform doesn't replace your existing stack — it extends it, making monitoring reliable as your systems grow.
Automatic Service Discovery
Every time you deploy a new service, it's automatically detected through OpenTelemetry traces. No static configuration or catalog updates — your service map stays accurate with each deployment, so you always know what's running where.
When a new microservice appears in production, Last9:
- Detects it from trace data
- Maps its dependencies
- Starts collecting relevant metrics
- Updates dashboards automatically
You don't update scrape configs or service registries manually.
Track Background Jobs
Async and background workloads are treated as first-class components. You can track job execution times, throughput, and error rates alongside services. When a job fails, traces and logs give you the complete execution context, so debugging takes minutes, not hours.
# Track job execution time
job_duration_seconds{job_name="data-sync"}
# Monitor success/failure rates
rate(job_executions_total{status="failed"}[5m])
Infrastructure Metrics Without Extra Setup
Infrastructure metrics — CPU, memory, disk, and network — are collected across all nodes without extra setup. You can correlate system-level degradation with application performance to see whether the slowdown comes from the code or the underlying hardware.
If latency spikes at 3 AM, Last9 shows you:
- Application metrics: Request rate, error rate, P95 latency
- Infrastructure metrics: CPU at 95%, memory pressure
- The correlation: High CPU on specific nodes hosting the service
Handle Cardinality at Scale
High-cardinality telemetry offers valuable context but can overwhelm Prometheus storage. Last9 handles this with streaming aggregation and retention rules, keeping fine-grained detail where it's useful and trimming redundant series before they inflate costs.
Instead of storing every user_id
label forever:
- Keep high-cardinality data for 7 days (for debugging recent issues)
- Aggregate older data by service, region, status code
- Drop labels that don't affect queries
You get the detail when you need it, without the storage cost.
Optimize Query Performance
Complex PromQL queries often slow down dashboards as datasets grow. Last9 optimizes queries at the storage layer, so even panels visualizing billions of data points remain quick and responsive.
The same dashboard that took 15 seconds to load in Prometheus renders in under 2 seconds with Last9's query optimization.
Group Alerts by Impact
Instead of triggering dozens of alerts for the same issue, Last9 groups related signals by service, job, or host. You can trace how a failure propagates through dependencies and focus directly on the component that needs attention.
When a database goes down:
- Don't get 50 alerts (one per service calling it)
- Get 1 grouped alert: "Database unavailable, affecting 12 services"
- See the dependency graph: which services are impacted, which are still healthy
Unified Telemetry
Metrics, logs, and traces live in one place. When latency spikes, you can jump straight from a metric to the corresponding trace and log line without switching between tools or dashboards.
How it works:
- Prometheus APM shows P95 latency increased
- Click the spike in the graph
- See traces for slow requests during that window
- Jump to logs from the failing service
- Find the root cause without context switching
This is where Prometheus APM (metrics) meets full observability (metrics + traces + logs).
Start for free today with Last9 or book sometime with us to understand how it fits into your stack!
FAQs
What is Prometheus APM?
Prometheus APM is a metrics-driven application performance monitoring that tracks request rates, error rates, latency, and resource utilization through time-series data. Unlike traditional APM tools that use agents and distributed tracing, Prometheus APM uses exporters and the /metrics
endpoint to collect performance data. It's lightweight, open source, and works well with Kubernetes environments.
When should I start scaling my Prometheus setup?
You should consider scaling Prometheus when you notice queries taking longer than 5-10 seconds to complete, storage growing beyond 100GB with standard retention, memory usage consistently above 80%, or more than 5,000-10,000 active time series per instance. These are signals that your current setup needs optimization through recording rules, federation, or sharding.
What's the difference between Prometheus federation and sharding?
Federation pulls aggregated metrics from multiple Prometheus instances into a central server, useful for multi-cluster or multi-team setups where you want a unified view. Sharding distributes scrape targets across multiple Prometheus instances based on labels or hash functions, useful when a single instance can't handle the ingestion load. Federation is about aggregation; sharding is about distribution.
How do recording rules help with Prometheus performance?
Recording rules precompute expensive queries (like histogram_quantile()
or complex aggregations) and store the results as a new time series. Instead of calculating the 95th percentile latency every time a dashboard loads, Prometheus calculates it once every 30 seconds and stores the result. This trades a small amount of additional storage for significantly faster query performance.
What causes high cardinality in Prometheus?
High cardinality happens when labels create too many unique time series combinations. Common causes include: user IDs or request IDs as labels, timestamps in label values, unbounded label values like full URLs, or IP addresses. A metric with 10 labels, where each has 10 possible value,s creates 10 billion potential series. Use labels for dimensions you actually query, not for unique identifiers.
How much storage does Prometheus need?
Storage depends on: number of active time series, scrape interval, sample size (typically 1-2 bytes), and retention period. A rough calculation: 1 million series × 4 samples per minute (15s interval) × 2 bytes × 60 minutes × 24 hours × 30 days = approximately 345GB for 30-day retention. Add 50% overhead for indexes and compaction.
What's the best retention period for Prometheus?
For local storage, 15-30 days is typical. This keeps recent data fast and accessible while preventing excessive disk usage. For longer retention (90+ days), use remote write to send data to long-term storage like Thanos, Cortex, or Mimir. Keep frequently-queried recent data local, push historical data to remote storage.
Should I use Prometheus or a commercial APM tool?
Use Prometheus APM when you need metrics-driven monitoring, have Kubernetes infrastructure, want no per-host pricing, and can instrument services with exporters. Use commercial APM when you need distributed tracing out of the box, code-level profiling, automatic instrumentation, or vendor support. Many teams start with Prometheus and add distributed tracing later through OpenTelemetry.
How do I reduce alert fatigue in Prometheus?
Group alerts by service or environment labels instead of firing per-pod alerts. Use the for
clause to require sustained conditions (e.g., for: 5m
) before alerting. Focus alerts on user-impacting signals like error rates and latency, not internal metrics like CPU spikes that self-recover. Configure Alertmanager routing to send alerts to the right teams based on labels.
What's the difference between remote write and federation?
Remote write pushes all scraped metrics to external storage in real-time, creating a complete copy of your data elsewhere. Federation pulls specific metrics from child Prometheus instances on a schedule, typically aggregated data. Use remote write for long-term storage and disaster recovery. Use federation for hierarchical monitoring across teams or regions.
Can Prometheus handle millions of time series?
A single Prometheus instance can handle 1-2 million active time series with 16-32GB of RAM, but performance degrades with complex queries. Beyond that, use sharding to distribute the load across multiple instances, implement recording rules for expensive queries, or consider managed Prometheus solutions that handle horizontal scaling automatically.
How does Last9 improve Prometheus performance?
Last9 optimizes Prometheus through streaming aggregation that reduces cardinality at ingestion, query optimization at the storage layer, and automatic retention policies that keep high-cardinality data short-term while aggregating long-term storage. It also adds automatic service discovery, unified metrics/logs/traces, and intelligent alert grouping without requiring you to rewrite your existing Prometheus setup.