Vibe monitoring with Last9 MCP: Ask your agent to fix production issues! Setup →
Last9 Last9

Apr 10th, ‘25 / 12 min read

How to Use Prometheus for APM

Learn how to turn Prometheus into a powerful APM tool—track app performance, reduce guesswork, and get real visibility into your systems.

How to Use Prometheus for APM

Monitoring applications shouldn’t be a guessing game. But too often, DevOps engineers end up buried under a pile of metrics that don’t help when things go wrong.

That’s where Prometheus APM comes in. It offers a straightforward way to make sense of your systems—especially when you're working with modern, distributed setups like microservices.

What Is Prometheus APM and How Does It Transform Monitoring?

Prometheus APM combines the power of Prometheus, an open-source monitoring system, with Application Performance Monitoring capabilities.

Unlike traditional monitoring tools that just throw numbers at you, Prometheus APM connects infrastructure metrics with application performance data, giving you the full picture of your system's health and performance.

At its core, Prometheus APM helps you track, measure, and improve how your applications perform in real time. It's like having x-ray vision into your entire stack – from server CPU usage to how long that pesky database query is taking. The system works on a pull-based model, where the Prometheus server scrapes metrics from instrumented applications at regular intervals, storing them in a time-series database for analysis.

Key components of Prometheus APM include:

  • Prometheus Server: The central component that scrapes and stores time series data
  • Client Libraries: For instrumenting application code to expose metrics
  • Pushgateway: For supporting short-lived jobs
  • Alertmanager: Handles alerts sent by the Prometheus server
  • Exporters: For services that don't natively expose Prometheus metrics
  • Visualization Layer: Often Grafana, which connects to Prometheus as a data source

What sets Prometheus APM apart is its dimensional data model, where metrics are identified by metric name and key-value pairs, enabling powerful querying capabilities through PromQL (Prometheus Query Language).

💡
If you're writing PromQL for your APM dashboards, these PromQL tricks can save you time and help surface what actually matters.

Why DevOps Teams Are Rapidly Adopting Prometheus APM for Modern Infrastructure

DevOps engineers aren't jumping on the Prometheus APM train just because it's trendy. There are solid reasons behind this shift, rooted in technical advantages and operational improvements:

Traditional Monitoring Prometheus APM
Siloed metrics with separate tools for infrastructure and applications End-to-end visibility across the entire stack
Manual correlation between different monitoring systems Automated context linking infrastructure issues with application impacts
Fixed dashboards with limited customization Dynamic visualization with complex query support
Reactive troubleshooting after issues occur Proactive alerts based on predictive thresholds
Limited scalability for high-cardinality data Designed for cloud-native, high-scale environments
Complex setup with heavy agents Lightweight exporters and client libraries
Vendor lock-in with proprietary systems Open-source ecosystem with flexible integration options

The adoption rate of Prometheus APM has seen steady growth for several key reasons:

  1. Cloud-Native Design: Built from the ground up for dynamic environments like Kubernetes
  2. Pull-Based Architecture: More reliable in unstable networks and easier to control
  3. Service Discovery Integration: Automatically identifies new targets in dynamic infrastructures
  4. Resource Efficiency: Lightweight and high performance compared to traditional APM solutions
  5. Active Community: Constant improvements and a wide ecosystem of exporters and integrations

The tool brings together infrastructure monitoring and application performance in one unified system – no more tab-hopping between different tools trying to piece together what went wrong during an incident.

💡
If you’re working with Prometheus programmatically, this Prometheus API guide breaks down the essentials without the noise.

Getting Started With Prometheus APM: A Comprehensive Setup Guide

Setting up Prometheus APM involves several components working together to create a complete monitoring solution. Let's break down the process into manageable steps:

Step 1: Installing the Prometheus Server

You can deploy Prometheus via binary download, Docker, or through Kubernetes operators. Here's the traditional installation method:

# Download the latest Prometheus release
wget https://github.com/prometheus/prometheus/releases/download/v2.37.0/prometheus-2.37.0.linux-amd64.tar.gz

# Extract the archive
tar xvfz prometheus-*.tar.gz
cd prometheus-*

# Optionally move binaries to a permanent location
sudo mv prometheus promtool /usr/local/bin/

For Docker users, this simplified approach works well:

docker run -d --name prometheus -p 9090:9090 -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml prom/prometheus

Step 2: Creating a Comprehensive Configuration

Your prometheus.yml file is the heart of your monitoring setup. A more detailed configuration might look like this:

global:
  scrape_interval: 15s  # Set how frequently to scrape targets
  evaluation_interval: 15s  # How frequently to evaluate rules
  scrape_timeout: 10s  # How long until a scrape times out

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

# Load rules once and periodically evaluate them
rule_files:
  - "alert_rules.yml"
  - "recording_rules.yml"

scrape_configs:
  # Self-monitoring
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
      
  # Example node exporter for server metrics
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
      
  # Example application
  - job_name: 'my-application'
    metrics_path: '/metrics'
    static_configs:
      - targets: ['app-server:8080']
    
  # Dynamic target discovery (e.g., for Kubernetes)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

Step 3: Starting and Securing Prometheus

Launch the Prometheus server:

./prometheus --config.file=prometheus.yml --storage.tsdb.retention.time=15d --web.enable-lifecycle

The additional flags configure:

  • storage.tsdb.retention.time: How long to retain metrics data (15 days)
  • web.enable-lifecycle: Allows reloading config via HTTP endpoint

For production environments, consider setting up:

  • Authentication using a reverse proxy
  • TLS encryption
  • Storage retention policies appropriate for your needs
  • Remote write endpoints for long-term storage
💡
Setting up alerts? This Prometheus Alertmanager guide covers how to do it without creating alert fatigue.

Step 4: Setting Up Alertmanager for Notifications

Create an Alertmanager configuration (alertmanager.yml):

route:
  group_by: ['alertname', 'job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-emails'

receivers:
- name: 'team-emails'
  email_configs:
  - to: 'devops-team@example.com'
    
- name: 'slack-notifications'
  slack_configs:
  - channel: '#alerts'
    api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXX'

Launch Alertmanager:

./alertmanager --config.file=alertmanager.yml

Step 5: Installing Exporters for Common Services

For database monitoring:

# MySQL exporter
docker run -d --name mysql_exporter -p 9104:9104 -e DATA_SOURCE_NAME="user:password@(hostname:3306)/" prom/mysqld-exporter

# PostgreSQL exporter
docker run -d --name postgres_exporter -p 9187:9187 -e DATA_SOURCE_NAME="postgresql://user:password@hostname:5432/database?sslmode=disable" wrouesnel/postgres_exporter

For Node metrics:

# Install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
cd node_exporter-*
./node_exporter

Once everything is running, you can access the Prometheus UI at http://localhost:9090 and begin exploring your metrics. For a more powerful visualization experience, install Grafana and connect it to your Prometheus server as a data source.

💡
Running into limits with Prometheus is common—these scaling tips and strategies come from real-world experience and can help.

Instrumenting Your Applications for Effective APM Monitoring

The real value of Prometheus APM shows up when your applications start speaking its language. Instrumentation is what makes that happen. Without it, you’re monitoring in the dark—limited to what the infrastructure can tell you. To get the full picture, you need your applications to surface their metrics.

Let’s break down how to instrument different types of applications, the right way.

For Java Applications with Spring Boot

The Spring Boot Actuator makes Prometheus integration straightforward:

<!-- Add to your pom.xml -->
<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>

In your application.properties:

management.endpoints.web.exposure.include=prometheus,health,info
management.metrics.tags.application=${spring.application.name}
management.metrics.distribution.percentiles-histogram.http.server.requests=true

For manual instrumentation in Java applications:

<!-- Core library -->
<dependency>
  <groupId>io.prometheus</groupId>
  <artifactId>simpleclient</artifactId>
  <version>0.16.0</version>
</dependency>
<!-- Hotspot JVM metrics -->
<dependency>
  <groupId>io.prometheus</groupId>
  <artifactId>simpleclient_hotspot</artifactId>
  <version>0.16.0</version>
</dependency>
<!-- Exposition servlet -->
<dependency>
  <groupId>io.prometheus</groupId>
  <artifactId>simpleclient_servlet</artifactId>
  <version>0.16.0</version>
</dependency>

Java implementation example:

import io.prometheus.client.Counter;
import io.prometheus.client.Histogram;

class YourClass {
    // Define metrics
    static final Counter requestsTotal = Counter.build()
        .name("requests_total")
        .help("Total requests.")
        .labelNames("path", "status")
        .register();
    
    static final Histogram requestLatency = Histogram.build()
        .name("request_latency_seconds")
        .help("Request latency in seconds.")
        .buckets(0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10)
        .register();
    
    void handleRequest(Request request) {
        Histogram.Timer timer = requestLatency.startTimer();
        try {
            // Your business logic
            processRequest(request);
            // Record success
            requestsTotal.labels(request.getPath(), "success").inc();
        } catch (Exception e) {
            // Record failure
            requestsTotal.labels(request.getPath(), "error").inc();
            throw e;
        } finally {
            // Record latency
            timer.observeDuration();
        }
    }
}

For Python Applications and Microservices

Install the client library:

pip install prometheus-client

Basic Flask integration:

from flask import Flask, request
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time

app = Flask(__name__)

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total', 
    'Total HTTP Requests', 
    ['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds', 
    'HTTP Request Latency', 
    ['method', 'endpoint']
)

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    request_latency = time.time() - request.start_time
    REQUEST_LATENCY.labels(request.method, request.path).observe(request_latency)
    REQUEST_COUNT.labels(request.method, request.path, response.status_code).inc()
    return response

@app.route('/metrics')
def metrics():
    return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}

@app.route('/')
def homepage():
    return "Hello World"

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

For Node.js Applications

Install the Prometheus client:

npm install prom-client

Express.js implementation:

const express = require('express');
const promClient = require('prom-client');

// Create a Registry to register the metrics
const register = new promClient.Registry();
promClient.collectDefaultMetrics({ register });

// Create custom metrics
const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status'],
  registers: [register]
});

const httpRequestDurationMicroseconds = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.1, 0.3, 0.5, 0.7, 1, 3, 5, 7, 10],
  registers: [register]
});

const app = express();

// Middleware to collect metrics
app.use((req, res, next) => {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = Date.now() - start;
    httpRequestDurationMicroseconds
      .labels(req.method, req.route?.path || req.path)
      .observe(duration / 1000); // Convert to seconds
    
    httpRequestsTotal
      .labels(req.method, req.route?.path || req.path, res.statusCode)
      .inc();
  });
  
  next();
});

// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.get('/', (req, res) => {
  res.send('Hello World!');
});

app.listen(3000, () => {
  console.log('Server listening on port 3000');
});
💡
Understanding the different Prometheus metric types is key—this guide breaks them down with clarity.

The Four Key Metric Types to Monitor

When instrumenting your applications, focus on these four metric types:

Metric Type Purpose Example
Counters Cumulative values that only increase Total requests, errors, completed tasks
Gauges Values that can go up and down Memory usage, active connections, queue size
Histograms Distribution of values in buckets Request duration, response size
Summaries Similar to histograms but calculate quantiles client-side Request duration with calculated percentiles

Remember to add meaningful labels to your metrics, but be careful not to create too many unique combinations (high cardinality), as this can impact Prometheus' performance.

Common Prometheus APM Problems and Their Strategic Solutions

Deploying Prometheus APM isn't without challenges. Here are the most common issues DevOps teams face and how to overcome them:

Problem 1: Metrics Storage and Data Overload

When collecting metrics at scale, you can quickly accumulate terabytes of time-series data, overwhelming your storage and query performance.

Symptoms:

  • Slow query responses in the Prometheus UI
  • High disk I/O on the Prometheus server
  • Constant disk space alerts
  • Out-of-memory errors

Strategic Solutions:

  1. Implement Retention Policies: Configure --storage.tsdb.retention.time to keep data only as long as needed (default is 15 days)
  2. Focus on the Four Golden Signals as defined by Google SRE:
    • Latency: Time taken to serve a request
    • Traffic: Demand on your system
    • Errors: Rate of failed requests
    • Saturation: How "full" your system is
  3. Implement Federation: Set up hierarchical Prometheus servers with filtered metrics
  4. Configure Remote Storage: Use long-term storage solutions like Thanos, Cortex, or VictoriaMetrics

Use Recording Rules: Pre-compute frequently used queries to improve performance

# recording_rules.ymlgroups:- name: example  rules:  - record: job:http_inprogress_requests:sum    expr: sum(http_inprogress_requests) by (job)
💡
Last9 helps you set up reliable, noise-free alerts with Prometheus and OpenTelemetry support. More signal, less chaos: Last9 Alerting.

Problem 2: Alert Fatigue and Noise Management

Poorly configured alerts can lead to constant notifications, causing teams to ignore or disable alerts altogether.

Symptoms:

  • DevOps team ignoring alerts
  • Too many non-actionable notifications
  • Difficulty identifying critical issues during incidents

Strategic Solutions:

  1. Set Appropriate Thresholds: Base thresholds on historical data, not guesses
  2. Create Multi-level Severity: Use different notification channels based on alert severity
  3. Set Up Maintenance Windows: Silence alerts during planned maintenance

Implement Alert Grouping: Configure Alertmanager to group related alerts

route:  group_by: ['alertname', 'cluster', 'service']  group_wait: 30s  group_interval: 5m

Use Rate Functions and Duration Conditions: Alert on trends rather than instantaneous spikes

# Alert only when error rate exceeds 5% for 5 minutes- alert: HighErrorRate  expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05  for: 5m  labels:    severity: warning  annotations:    summary: "High error rate detected"    description: "Error rate is above 5% for 5 minutes (current value: {{ $value }})"

Problem 3: Missing Context in Distributed Systems

Metrics alone can't provide complete visibility into complex distributed systems.

Symptoms:

  • Difficulty tracing requests across microservices
  • Unknown dependencies between services
  • Incomplete root cause analysis during outages

Strategic Solutions:

  1. Use Red/Black Dashboards: Organize metrics to show service health at a glance
  2. Implement Synthetic Monitoring: Create end-to-end tests that verify user journeys
  3. Add Service Dependency Mapping: Use tools like Grafana Service Graph or Prometheus Blackbox Exporter

Add Service Discovery: Automatically detect and monitor new services

scrape_configs:  - job_name: 'kubernetes-pods'    kubernetes_sd_configs:      - role: pod

Implement Distributed Tracing: Integrate with OpenTelemetry, Jaeger, or Zipkin

# In Jaeger, correlate with Prometheus using exemplarshistogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="my-service"}[5m])) by (le)) # exemplar={trace_id="abc123"}
💡
Fix production Prometheus APM issues instantly—right from your IDE, with AI and Last9 MCP.

Problem 4: Performance Overhead and Resource Constraints

Instrumenting applications can add overhead, particularly in resource-constrained environments.

Symptoms:

  • Increased latency after adding instrumentation
  • Higher CPU and memory usage
  • Application crashes when under heavy load

Strategic Solutions:

  1. Use Client-side Batching: Aggregate metrics before exposing them
  2. Implement Load Shedding: Configure your exporters to drop less critical metrics under load
  3. Use the Push Gateway for batch jobs and resource-constrained environments

Monitor Your Monitoring: Apply Prometheus to monitor itself

rate(prometheus_target_scrapes_exceeded_sample_limit_total[5m])

Optimize Scrape Intervals: Adjust based on metric volatility (less frequent for stable metrics)

scrape_configs:  - job_name: 'stable-metrics'    scrape_interval: 60s  # Less frequent for stable metrics  - job_name: 'volatile-metrics'    scrape_interval: 10s  # More frequent for volatile ones

Building Comprehensive Prometheus APM Dashboards for Maximum Visibility

A powerful Prometheus APM setup deserves equally powerful visualization. Effective dashboards transform raw metrics into actionable insights that both engineers and business stakeholders can understand.

The Art and Science of Effective APM Dashboards

Great dashboards don't just display data—they tell a story about your system's performance. Follow these principles when designing your Prometheus APM dashboards:

  1. Layer Information: Start with high-level overviews and allow drill-down to details
  2. Use Visual Hierarchy: Important metrics should stand out visually
  3. Include Context: Add thresholds and historical comparisons
  4. Design for the Audience: Technical dashboards for engineers, simplified views for management
  5. Optimize for Quick Understanding: Use consistent colors and layouts

Essential Dashboard Components for Complete Visibility

Dashboard Panel Metrics to Include PromQL Examples
Application Health Error rates, response times, request volume sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
Resource Usage CPU, memory, disk I/O, network traffic avg by(instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))
User Experience Page load times, API latency, client errors histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
Business Impact Conversions, transactions, user signups sum(increase(business_transactions_total[24h]))
Service Dependencies Upstream/downstream service health sum by(service) (rate(service_requests_total{status="failed"}[5m]))
Database Performance Query times, connection pools, cache hit rates rate(mysql_global_status_questions[5m])
Queue Metrics Queue length, processing time, age of oldest item rabbitmq_queue_messages_ready

Building a RED Method Dashboard

The RED Method (Rate, Error, Duration) provides a concise view of service health:

# Request Rate
sum(rate(http_requests_total[5m])) by (service)

# Error Rate
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)

# Duration (95th percentile)
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Creating a USE Method Dashboard

The USE Method (Utilization, Saturation, Errors) works well for resources:

# CPU Utilization
avg by (instance) (irate(node_cpu_seconds_total{mode!="idle"}[5m]))

# Memory Saturation
node_memory_Active_bytes / node_memory_MemTotal_bytes

# Disk Errors
rate(node_disk_io_time_weighted_seconds_total[5m])

Pro tip: Use Grafana with Prometheus APM for dashboards that stand out. The combination gives you drag-and-drop simplicity with deep customization options. Start with the official Grafana dashboards for common exporters (like Node Exporter Dashboard #1860) and customize them for your specific needs.

💡
Misconfigured ports can break monitoring silently—this guide walks through setting up Prometheus ports the right way.

Advanced Dashboard Techniques

Service Level Objective (SLO) Tracking: Monitor compliance with service level objectives

# Availability SLO (99.9%)
1 - (sum(increase(http_requests_total{status=~"5.."}[7d])) / sum(increase(http_requests_total[7d])))

Heatmaps for Latency: Visualize latency distributions over time

sum(increase(http_request_duration_bucket[5m])) by (le)

Multi-Variable Dashboards: Use Grafana template variables to create dynamic dashboards

services=label_values(http_requests_total, service)

Scaling Prometheus APM for Enterprise Environments

Prometheus APM shines in certain scenarios but isn't always the perfect fit:

  • Perfect for: Kubernetes environments, microservices architectures, and dynamic cloud infrastructures
  • Consider alternatives for: Very large-scale enterprises (10,000+ servers), environments requiring 100% uptime guarantees

Conclusion

Whether you're tracking a handful of services or managing a sprawling microservices setup, Prometheus APM gives you the visibility needed to keep systems steady and reliable.

But if you’d rather not manage it all yourself, and you're looking for a Prometheus-compatible observability platform, Last9 is worth a look. It’s built to handle high-cardinality workloads at scale—powering observability for teams at Disney+ Hotstar, CleverTap, and Replit.

As a telemetry data platform, we have monitored 11 of the 20 largest live-streaming events in history. It plugs right into OpenTelemetry and Prometheus, bringing metrics, logs, and traces together in one place.

If observability costs or performance are becoming a concern, let’s talk. We’ve helped teams get both under control—without losing visibility where it counts.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Prathamesh Sonpatki

Prathamesh Sonpatki

Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

X