This guide shows you how to set up a comprehensive monitoring solution with Prometheus, Grafana, Node Exporter, cAdvisor, and Alertmanager using Docker Compose. Follow the step-by-step instructions to deploy a powerful monitoring stack for your containerized environment.
Why Use Docker Compose for Prometheus?
Docker Compose offers significant advantages for deploying monitoring tools:
- Simplified deployment: Define your entire monitoring stack in a single configuration file
- Consistent environments: Deploy the same setup across development, testing, and production
- Easy updates: Upgrade components individually with minimal disruption
- Infrastructure as code: Version control your monitoring configuration
These benefits make Docker Compose ideal for setting up and maintaining Prometheus-based monitoring systems.
Prerequisites for Installation
Before you begin, ensure you have:
- Docker Engine (version 19.03.0+)
- Docker Compose (version 1.27.0+)
- 2+ CPU cores and 4GB+ RAM
- At least 20GB free disk space
- Basic understanding of Docker concepts
Step-by-Step Guide to Monitoring Prometheus with Docker
Step 1: Creating the Project Structure
Start by setting up a well-organized directory structure:
# Create project directory
mkdir prometheus-monitoring
cd prometheus-monitoring
# Create subdirectories for configurations
mkdir -p prometheus/rules alertmanager grafana/provisioning/{datasources,dashboards}
This directory structure helps keep things clean and manageable as your monitoring setup grows.
prometheus/rules/
is where you’ll store custom alerting and recording rules.alertmanager/
will hold the Alertmanager config file, including routing and notification settings.grafana/provisioning/
is split intodatasources/
anddashboards/
to support automated Grafana setup—so your dashboards and data sources load automatically on startup.
Organizing your files this way makes it easier to version-control, update configs independently, and troubleshoot issues faster.
Step 2: Defining the Docker Compose Configuration
Create a docker-compose.yml
file in the project root:
version: '3.8'
volumes:
prometheus_data: {}
grafana_data: {}
networks:
monitoring:
driver: bridge
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
- '--web.console.templates=/usr/share/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
networks:
- monitoring
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)'
ports:
- "9100:9100"
networks:
- monitoring
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
ports:
- "8080:8080"
networks:
- monitoring
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
networks:
- monitoring
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
volumes:
- ./alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
networks:
- monitoring
restart: unless-stopped
This Docker Compose setup wires together all the key components for a solid monitoring stack:
- Prometheus handles time-series data collection and storage. It pulls metrics from exporters and other endpoints based on your configuration. The
--web.enable-lifecycle
flag lets you trigger config reloads without restarting the container. - Node Exporter collects low-level system metrics from the host—like CPU usage, memory, and disk stats. We're mounting
/proc
and/sys
read-only so Prometheus can scrape accurate host metrics without affecting the system. - cAdvisor focuses on container-level metrics, offering insights into resource usage per container—handy when you’re running multiple services on the same host.
- Grafana sits on top of Prometheus and provides a user-friendly interface to visualize your data. The provisioning folders (
datasources
anddashboards
) ensure everything is set up automatically on first run. - Alertmanager receives alerts from Prometheus and routes them to the right place—Slack, PagerDuty, email, etc. Mounting the config from your local folder keeps it easy to tweak as your alerting needs evolve.
The volumes ensure data persists across restarts, and the shared monitoring
network lets all services communicate internally. This setup gives you full control and visibility over your Docker environment—with minimal manual steps.
Step 3: Configuring Prometheus
Create a prometheus.yml
file in the prometheus
directory:
global:
scrape_interval: 15s
evaluation_interval: 15s
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules once and periodically evaluate them
rule_files:
- "rules/*.yml"
# Scrape configurations
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
Optimize Performance with Target-Specific Scrape Intervals
Most tutorials use the same scrape interval for all targets, but this is inefficient. Instead, customize intervals based on how frequently metrics change:
scrape_configs:
# System metrics change frequently, scrape more often
- job_name: 'node-exporter'
scrape_interval: 10s
static_configs:
- targets: ['node-exporter:9100']
# Container metrics are also volatile
- job_name: 'cadvisor'
scrape_interval: 10s
static_configs:
- targets: ['cadvisor:8080']
# Prometheus itself changes slowly, scrape less frequently
- job_name: 'prometheus'
scrape_interval: 30s
static_configs:
- targets: ['localhost:9090']
This approach reduces unnecessary scrapes while ensuring critical metrics are captured with appropriate frequency.
Step 4: Setting Up Alert Rules
Create an alert rules file at prometheus/rules/node_alerts.yml
:
groups:
- name: node_alerts
rules:
- alert: HighCPULoad
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU load (instance {{ $labels.instance }})"
description: "CPU load is > 80%\n VALUE = {{ $value }}%\n LABELS: {{ $labels }}"
- alert: HighMemoryLoad
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High memory load (instance {{ $labels.instance }})"
description: "Memory load is > 80%\n VALUE = {{ $value }}%\n LABELS: {{ $labels }}"
- alert: HighDiskUsage
expr: (node_filesystem_size_bytes{fstype=~"ext4|xfs"} - node_filesystem_free_bytes{fstype=~"ext4|xfs"}) / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High disk usage (instance {{ $labels.instance }})"
description: "Disk usage is > 85%\n VALUE = {{ $value }}%\n LABELS: {{ $labels }}"
Creating Predictive Alerts to Detect Unusual Metric Patterns
Most monitoring setups only alert on threshold violations. Here's a unique alert that detects abnormal patterns:
# Add this to your alert rules file
- alert: UnusualMemoryGrowth
expr: deriv(node_memory_MemAvailable_bytes[30m]) < -10 * 1024 * 1024
for: 10m
labels:
severity: warning
annotations:
summary: "Unusual memory consumption rate (instance {{ $labels.instance }})"
description: "Memory is being consumed at a rate of more than 10MB/min\n VALUE = {{ $value | humanize }}B/s"
This alert detects unusual memory consumption patterns even before critical thresholds are reached, providing an earlier warning of potential issues.
Practical Container Alert Examples
Create a file at prometheus/rules/container_alerts.yml
with practical container-specific alerts:
groups:
- name: container_alerts
rules:
- alert: ContainerRestarting
expr: delta(container_start_time_seconds{name!=""}[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Container restarting ({{ $labels.name }})"
description: "Container {{ $labels.name }} has restarted in the last 15 minutes"
- alert: ContainerHighMemoryUsage
expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Container high memory usage ({{ $labels.name }})"
description: "Container {{ $labels.name }} memory usage is {{ $value }}%"
- alert: ContainerCPUThrottling
expr: rate(container_cpu_cfs_throttled_periods_total{name!=""}[5m]) / rate(container_cpu_cfs_periods_total{name!=""}[5m]) > 0.25
for: 5m
labels:
severity: warning
annotations:
summary: "Container CPU throttling ({{ $labels.name }})"
description: "Container {{ $labels.name }} is being throttled {{ $value | humanizePercentage }}"
These alerts catch real-world container issues that often go unnoticed:
- Container restarts that might indicate application crashes or configuration issues
- Memory pressure that can lead to OOM kills
- CPU throttling that affects application performance
Step 5: Configuring Alertmanager
Create a basic Alertmanager configuration in alertmanager/config.yml
:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'email-notifications'
receivers:
- name: 'email-notifications'
email_configs:
- to: 'your-email@example.com'
from: 'alertmanager@example.com'
smarthost: smtp.example.com:587
auth_username: 'your-username'
auth_password: 'your-password'
Implementing Team-Based Alert Routing for Efficient Incident Response
Unlike basic setups, you can route alerts to different teams based on service and severity:
route:
# Default receiver
receiver: 'operations-team'
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
# Specific routing rules
routes:
- match:
severity: critical
receiver: 'pager-duty'
repeat_interval: 1h
continue: true
- match_re:
service: database|redis|elasticsearch
receiver: 'database-team'
- match_re:
service: frontend|api
receiver: 'application-team'
receivers:
- name: 'operations-team'
email_configs:
- to: 'ops@example.com'
- name: 'pager-duty'
pagerduty_configs:
- service_key: 'your-pagerduty-key'
- name: 'database-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR_KEY'
channel: '#db-alerts'
- name: 'application-team'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR_KEY'
channel: '#app-alerts'
This configuration ensures alerts reach the right teams with appropriate urgency.
Step 6: Setting Up Grafana Dashboards
Configure Grafana to connect to Prometheus automatically by creating grafana/provisioning/datasources/datasource.yml
:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
Set up dashboard provisioning in grafana/provisioning/dashboards/dashboards.yml
:
apiVersion: 1
providers:
- name: 'Default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
Create a grafana/dashboards
directory to store dashboard JSON files:
mkdir -p grafana/dashboards
Grafana supports automatic setup through provisioning.
- In
datasource.yml
, we define Prometheus as the default data source using the internal Docker URL. - In
dashboards.yml
, we tell Grafana to load dashboards from a specific folder. - The
grafana/dashboards
directory is where you’ll store those dashboard JSON files.
With this setup, Grafana connects to Prometheus and loads dashboards automatically—no manual steps needed.
How to Create a Consolidated Dashboard for Complete System Visibility
Most setups require you to import separate dashboards for different components. Create a unified system dashboard at grafana/dashboards/system-overview.json
that shows key metrics from all sources on a single screen:
{
"title": "System Overview",
"uid": "system-overview",
"version": 1,
"panels": [
{
"title": "CPU Usage",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 0},
"targets": [{"expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"}]
},
{
"title": "Memory Usage",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 6, "y": 0},
"targets": [{"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100"}]
},
{
"title": "Disk Usage",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 12, "y": 0},
"targets": [{"expr": "(node_filesystem_size_bytes{mountpoint=\"/\"} - node_filesystem_free_bytes{mountpoint=\"/\"}) / node_filesystem_size_bytes{mountpoint=\"/\"} * 100"}]
},
{
"title": "Container CPU Usage",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 8},
"targets": [{"expr": "sum by(name) (rate(container_cpu_usage_seconds_total{name!=\"\"}[5m])) * 100"}]
},
{
"title": "Container Memory Usage",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 8},
"targets": [{"expr": "sum by(name) (container_memory_usage_bytes{name!=\"\"})"}]
}
]
}
Update your Docker Compose file to mount this directory:
grafana:
# ... existing configuration ...
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards # Add this line
This approach automates dashboard creation and provides a unified view of system and container metrics.
Step 7: Performance Optimization for Production
For production deployments, optimize Prometheus for better performance:
# Update in docker-compose.yml
prometheus:
# ... existing configuration ...
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=15d' # Adjust retention period
- '--storage.tsdb.wal-compression' # Compress write-ahead log
- '--web.enable-lifecycle' # Enable runtime reloading
Implement Pre-Computed Metrics to Accelerate Dashboard Performance
Most setups calculate metrics on the fly, causing dashboard slowdowns. Pre-compute frequently used metrics with recording rules in prometheus/rules/recording_rules.yml
:
groups:
- name: recording_rules
interval: 1m
rules:
- record: node:cpu_usage:avg5m
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: node:memory_usage:percent
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
- record: container:cpu_usage:avg5m
expr: sum by(name) (rate(container_cpu_usage_seconds_total{name!=""}[5m])) * 100
Then update your dashboards to use these pre-computed metrics:
# Original query
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Optimized query using recording rule
node:cpu_usage:avg5m
This significantly improves dashboard performance, especially with many concurrent users.
Best Practices for Production Deployments
Version Pinning for Stability
Always use specific versions instead of "latest" tags for production:
prometheus:
image: prom/prometheus:v2.40.0 # Specific version, not 'latest'
# ...
grafana:
image: grafana/grafana:9.3.2 # Specific version
# ...
This ensures consistent behavior and prevents unexpected changes during updates.
Security Considerations
- Use Strong Passwords: Never use default credentials in production:
grafana:
# ...
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD} # Use env variable
- GF_USERS_ALLOW_SIGN_UP=false
- GF_AUTH_ANONYMOUS_ENABLED=false
- Network Restriction: Never expose monitoring services directly to the internet. Use a reverse proxy with authentication:
nginx:
image: nginx:latest
ports:
- "443:443"
volumes:
- ./nginx/nginx.conf:/etc/nginx/conf.d/default.conf
- ./nginx/certs:/etc/nginx/certs
- ./nginx/.htpasswd:/etc/nginx/.htpasswd
- Use Non-Root Users: Run containers with non-root users when possible:
grafana:
# ...
user: "472" # Grafana's built-in user
Step 8: Launching the Monitoring Stack
Start your monitoring stack:
docker-compose up -d
Verify all containers are running:
docker-compose ps
Access the monitoring interfaces:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (login with admin/admin)
- cAdvisor: http://localhost:8080
- Alertmanager: http://localhost:9093
Verify Your Setup
In Prometheus, go to Status > Targets to confirm all targets show "UP" status.
In Grafana, navigate to your dashboards to see system metrics.
Step 9: Monitoring Docker Containers
cAdvisor automatically collects container metrics. View these metrics in Prometheus with queries like:
- Container CPU:
sum by (name) (rate(container_cpu_usage_seconds_total{name!=""}[1m]))
- Container Memory:
sum by (name) (container_memory_usage_bytes{name!=""})
- Container Disk I/O:
sum by (name) (rate(container_fs_reads_bytes_total{name!=""}[1m]))
Identifying Unhealthy Containers with Targeted Monitoring Queries
Add these queries to your monitoring to identify unhealthy containers:
- Restarting Containers:
sum by(name) (delta(container_start_time_seconds{name!=""}[15m])) > 0
- High Container CPU:
rate(container_cpu_usage_seconds_total{name!=""}[1m]) * 100 > 80
- OOM Risk:
container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100 > 80
Troubleshooting Prometheus and Grafana in Docker Compose
If you encounter problems:
No data in Grafana
- Check Prometheus targets in Status > Targets
- Verify Prometheus data source is working in Grafana
- Test queries directly in Prometheus UI
- Verify the time range in Grafana includes data collection period
Container not showing in metrics
- Ensure cAdvisor has access to Docker socket
- Check container is running with
docker ps
- Verify metrics in Prometheus with
container_cpu_usage_seconds_total
Alerts not firing
- Check alert rules in Prometheus UI
- Verify Alertmanager configuration
- Test alert notifications manually
- Check that conditions persist long enough to trigger alerts (the
for
duration)
Performance issues
- Reduce scrape frequency for less important targets
- Use recording rules for frequently queried expressions
- Adjust the retention period based on available disk space
- Consider using remote storage for longer retention
Next Steps
You’ve now got a solid monitoring setup for your Docker environment. From here, it’s easy to extend—add service-specific exporters, refine dashboards, and connect alerting tools your team already uses.
If you’re starting to feel the limits of managing this stack yourself—or just want something that’s easier to scale and maintain—this is where Last9 can help.
We work with teams at Probo, CleverTap, Replit, and others to handle high-cardinality observability at scale. With native support for OpenTelemetry and Prometheus, Last9 brings metrics, logs, and traces into one place—optimized for performance, cost, and real-time debugging. We’ve even monitored 11 of the 20 largest live-streaming events in history.
Let us handle the complexity, so you can focus on building. Book sometime with us today!