This guide shows you how to set up a comprehensive monitoring solution with Prometheus, Grafana, Node Exporter, cAdvisor, and Alertmanager using Docker Compose. Follow the step-by-step instructions to deploy a powerful monitoring stack for your containerized environment.
Why Use Docker Compose for Prometheus?
Docker Compose offers significant advantages for deploying monitoring tools:
- Simplified deployment: Define your entire monitoring stack in a single configuration file
- Consistent environments: Deploy the same setup across development, testing, and production
- Easy updates: Upgrade components individually with minimal disruption
- Infrastructure as code: Version control your monitoring configuration
These benefits make Docker Compose ideal for setting up and maintaining Prometheus-based monitoring systems.
For a closer look at container-level metrics and what they tell you about performance, check out our guide on Docker container performance metrics.
Prerequisites for Installation
Before you begin, ensure you have:
- Docker Engine (version 19.03.0+)
- Docker Compose (version 1.27.0+)
- 2+ CPU cores and 4GB+ RAM
- At least 20GB free disk space
- Basic understanding of Docker concepts
Step-by-Step Guide to Monitoring Prometheus with Docker
Step 1: Creating the Project Structure
Start by setting up a well-organized directory structure:
# Create project directorymkdir prometheus-monitoringcd prometheus-monitoring
# Create subdirectories for configurationsmkdir -p prometheus/rules alertmanager grafana/provisioning/{datasources,dashboards}This directory structure helps keep things clean and manageable as your monitoring setup grows.
prometheus/rules/is where you’ll store custom alerting and recording rules.alertmanager/will hold the Alertmanager config file, including routing and notification settings.grafana/provisioning/is split intodatasources/anddashboards/to support automated Grafana setup—so your dashboards and data sources load automatically on startup.
Organizing your files this way makes it easier to version-control, update configs independently, and troubleshoot issues faster.
Step 2: Defining the Docker Compose Configuration
Create a docker-compose.yml file in the project root:
version: "3.8"
volumes: prometheus_data: {} grafana_data: {}
networks: monitoring: driver: bridge
services: prometheus: image: prom/prometheus:latest container_name: prometheus volumes: - ./prometheus:/etc/prometheus - prometheus_data:/prometheus command: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--web.console.libraries=/usr/share/prometheus/console_libraries" - "--web.console.templates=/usr/share/prometheus/consoles" - "--web.enable-lifecycle" ports: - "9090:9090" networks: - monitoring restart: unless-stopped
node-exporter: image: prom/node-exporter:latest container_name: node-exporter volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro command: - "--path.procfs=/host/proc" - "--path.sysfs=/host/sys" - "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($|/)" ports: - "9100:9100" networks: - monitoring restart: unless-stopped
cadvisor: image: gcr.io/cadvisor/cadvisor:latest container_name: cadvisor volumes: - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro - /dev/disk/:/dev/disk:ro ports: - "8080:8080" networks: - monitoring restart: unless-stopped
grafana: image: grafana/grafana:latest container_name: grafana volumes: - grafana_data:/var/lib/grafana - ./grafana/provisioning:/etc/grafana/provisioning environment: - GF_SECURITY_ADMIN_USER=admin - GF_SECURITY_ADMIN_PASSWORD=admin - GF_USERS_ALLOW_SIGN_UP=false ports: - "3000:3000" networks: - monitoring restart: unless-stopped
alertmanager: image: prom/alertmanager:latest container_name: alertmanager volumes: - ./alertmanager:/etc/alertmanager command: - "--config.file=/etc/alertmanager/config.yml" - "--storage.path=/alertmanager" ports: - "9093:9093" networks: - monitoring restart: unless-stoppedThis Docker Compose setup wires together all the key components for a solid monitoring stack:
- Prometheus handles time-series data collection and storage. It pulls metrics from exporters and other endpoints based on your configuration. The
--web.enable-lifecycleflag lets you trigger config reloads without restarting the container. - Node Exporter collects low-level system metrics from the host—like CPU usage, memory, and disk stats. We’re mounting
/procand/sysread-only so Prometheus can scrape accurate host metrics without affecting the system. - cAdvisor focuses on container-level metrics, offering insights into resource usage per container—handy when you’re running multiple services on the same host.
- Grafana sits on top of Prometheus and provides a user-friendly interface to visualize your data. The provisioning folders (
datasourcesanddashboards) ensure everything is set up automatically on first run. - Alertmanager receives alerts from Prometheus and routes them to the right place—Slack, PagerDuty, email, etc. Mounting the config from your local folder keeps it easy to tweak as your alerting needs evolve.
The volumes ensure data persists across restarts, and the shared monitoring network lets all services communicate internally. This setup gives you full control and visibility over your Docker environment—with minimal manual steps.
If you’re running into issues with service restarts, this guide on Docker Compose restart can help clarify what’s actually happening behind the scenes.
Step 3: Configuring Prometheus
Create a prometheus.yml file in the prometheus directory:
global: scrape_interval: 15s evaluation_interval: 15s
# Alertmanager configurationalerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
# Load rules once and periodically evaluate themrule_files: - "rules/*.yml"
# Scrape configurationsscrape_configs: - job_name: "prometheus" static_configs: - targets: ["localhost:9090"]
- job_name: "node-exporter" static_configs: - targets: ["node-exporter:9100"]
- job_name: "cadvisor" static_configs: - targets: ["cadvisor:8080"]Optimize Performance with Target-Specific Scrape Intervals
Most tutorials use the same scrape interval for all targets, but this is inefficient. Instead, customize intervals based on how frequently metrics change:
scrape_configs: # System metrics change frequently, scrape more often - job_name: "node-exporter" scrape_interval: 10s static_configs: - targets: ["node-exporter:9100"]
# Container metrics are also volatile - job_name: "cadvisor" scrape_interval: 10s static_configs: - targets: ["cadvisor:8080"]
# Prometheus itself changes slowly, scrape less frequently - job_name: "prometheus" scrape_interval: 30s static_configs: - targets: ["localhost:9090"]This approach reduces unnecessary scrapes while ensuring critical metrics are captured with appropriate frequency.
Step 4: Setting Up Alert Rules
Create an alert rules file at prometheus/rules/node_alerts.yml:
groups: - name: node_alerts rules: - alert: HighCPULoad expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU load (instance {{ $labels.instance }})" description: "CPU load is > 80%\n VALUE = {{ $value }}%\n LABELS: {{ $labels }}"
- alert: HighMemoryLoad expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 80 for: 5m labels: severity: warning annotations: summary: "High memory load (instance {{ $labels.instance }})" description: "Memory load is > 80%\n VALUE = {{ $value }}%\n LABELS: {{ $labels }}"
- alert: HighDiskUsage expr: (node_filesystem_size_bytes{fstype=~"ext4|xfs"} - node_filesystem_free_bytes{fstype=~"ext4|xfs"}) / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100 > 85 for: 5m labels: severity: warning annotations: summary: "High disk usage (instance {{ $labels.instance }})" description: "Disk usage is > 85%\n VALUE = {{ $value }}%\n LABELS: {{ $labels }}"Creating Predictive Alerts to Detect Unusual Metric Patterns
Most monitoring setups only alert on threshold violations. Here’s a unique alert that detects abnormal patterns:
# Add this to your alert rules file- alert: UnusualMemoryGrowth expr: deriv(node_memory_MemAvailable_bytes[30m]) < -10 * 1024 * 1024 for: 10m labels: severity: warning annotations: summary: "Unusual memory consumption rate (instance {{ $labels.instance }})" description: "Memory is being consumed at a rate of more than 10MB/min\n VALUE = {{ $value | humanize }}B/s"This alert detects unusual memory consumption patterns even before critical thresholds are reached, providing an earlier warning of potential issues.
Last9 offers full monitoring support, including alerts and notifications. But no matter which tool you use, alerting still comes with familiar challenges—gaps in coverage, alert fatigue, and cleanup overhead. These aren’t problems with simple fixes.
Practical Container Alert Examples
Create a file at prometheus/rules/container_alerts.yml with practical container-specific alerts:
groups: - name: container_alerts rules: - alert: ContainerRestarting expr: delta(container_start_time_seconds{name!=""}[15m]) > 0 for: 5m labels: severity: warning annotations: summary: "Container restarting ({{ $labels.name }})" description: "Container {{ $labels.name }} has restarted in the last 15 minutes"
- alert: ContainerHighMemoryUsage expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100) > 80 for: 5m labels: severity: warning annotations: summary: "Container high memory usage ({{ $labels.name }})" description: "Container {{ $labels.name }} memory usage is {{ $value }}%"
- alert: ContainerCPUThrottling expr: rate(container_cpu_cfs_throttled_periods_total{name!=""}[5m]) / rate(container_cpu_cfs_periods_total{name!=""}[5m]) > 0.25 for: 5m labels: severity: warning annotations: summary: "Container CPU throttling ({{ $labels.name }})" description: "Container {{ $labels.name }} is being throttled {{ $value | humanizePercentage }}"These alerts catch real-world container issues that often go unnoticed:
- Container restarts that might indicate application crashes or configuration issues
- Memory pressure that can lead to OOM kills
- CPU throttling that affects application performance
Step 5: Configuring Alertmanager
Create a basic Alertmanager configuration in alertmanager/config.yml:
global: resolve_timeout: 5m
route: group_by: ["alertname"] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: "email-notifications"
receivers: - name: "email-notifications" email_configs: - to: "your-email@example.com" from: "alertmanager@example.com" smarthost: smtp.example.com:587 auth_username: "your-username" auth_password: "your-password"Implementing Team-Based Alert Routing for Efficient Incident Response
Unlike basic setups, you can route alerts to different teams based on service and severity:
route: # Default receiver receiver: "operations-team" group_by: ["alertname", "severity"] group_wait: 30s group_interval: 5m repeat_interval: 4h
# Specific routing rules routes: - match: severity: critical receiver: "pager-duty" repeat_interval: 1h continue: true
- match_re: service: database|redis|elasticsearch receiver: "database-team"
- match_re: service: frontend|api receiver: "application-team"
receivers: - name: "operations-team" email_configs: - to: "ops@example.com"
- name: "pager-duty" pagerduty_configs: - service_key: "your-pagerduty-key"
- name: "database-team" slack_configs: - api_url: "https://hooks.slack.com/services/YOUR_KEY" channel: "#db-alerts"
- name: "application-team" slack_configs: - api_url: "https://hooks.slack.com/services/YOUR_KEY" channel: "#app-alerts"This configuration ensures alerts reach the right teams with appropriate urgency.
Step 6: Setting Up Grafana Dashboards
Configure Grafana to connect to Prometheus automatically by creating grafana/provisioning/datasources/datasource.yml:
apiVersion: 1
datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: trueSet up dashboard provisioning in grafana/provisioning/dashboards/dashboards.yml:
apiVersion: 1
providers: - name: "Default" orgId: 1 folder: "" type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /var/lib/grafana/dashboardsCreate a grafana/dashboards directory to store dashboard JSON files:
mkdir -p grafana/dashboardsGrafana supports automatic setup through provisioning.
- In
datasource.yml, we define Prometheus as the default data source using the internal Docker URL. - In
dashboards.yml, we tell Grafana to load dashboards from a specific folder. - The
grafana/dashboardsdirectory is where you’ll store those dashboard JSON files.
With this setup, Grafana connects to Prometheus and loads dashboards automatically—no manual steps needed.
If you want a deeper look at running Grafana in Docker containers, take a peek at our detailed guide on Grafana and Docker.
How to Create a Consolidated Dashboard for Complete System Visibility
Most setups require you to import separate dashboards for different components. Create a unified system dashboard at grafana/dashboards/system-overview.json that shows key metrics from all sources on a single screen:
{ "title": "System Overview", "uid": "system-overview", "version": 1, "panels": [ { "title": "CPU Usage", "type": "gauge", "gridPos": { "h": 8, "w": 6, "x": 0, "y": 0 }, "targets": [ { "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)" } ] }, { "title": "Memory Usage", "type": "gauge", "gridPos": { "h": 8, "w": 6, "x": 6, "y": 0 }, "targets": [ { "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100" } ] }, { "title": "Disk Usage", "type": "gauge", "gridPos": { "h": 8, "w": 6, "x": 12, "y": 0 }, "targets": [ { "expr": "(node_filesystem_size_bytes{mountpoint=\"/\"} - node_filesystem_free_bytes{mountpoint=\"/\"}) / node_filesystem_size_bytes{mountpoint=\"/\"} * 100" } ] }, { "title": "Container CPU Usage", "type": "graph", "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }, "targets": [ { "expr": "sum by(name) (rate(container_cpu_usage_seconds_total{name!=\"\"}[5m])) * 100" } ] }, { "title": "Container Memory Usage", "type": "graph", "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 }, "targets": [ { "expr": "sum by(name) (container_memory_usage_bytes{name!=\"\"})" } ] } ]}Update your Docker Compose file to mount this directory:
grafana: # ... existing configuration ... volumes: - grafana_data:/var/lib/grafana - ./grafana/provisioning:/etc/grafana/provisioning - ./grafana/dashboards:/var/lib/grafana/dashboards # Add this lineThis approach automates dashboard creation and provides a unified view of system and container metrics.
Step 7: Performance Optimization for Production
For production deployments, optimize Prometheus for better performance:
# Update in docker-compose.ymlprometheus: # ... existing configuration ... command: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention.time=15d" # Adjust retention period - "--storage.tsdb.wal-compression" # Compress write-ahead log - "--web.enable-lifecycle" # Enable runtime reloadingImplement Pre-Computed Metrics to Accelerate Dashboard Performance
Most setups calculate metrics on the fly, causing dashboard slowdowns. Pre-compute frequently used metrics with recording rules in prometheus/rules/recording_rules.yml:
groups: - name: recording_rules interval: 1m rules: - record: node:cpu_usage:avg5m expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
- record: node:memory_usage:percent expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
- record: container:cpu_usage:avg5m expr: sum by(name) (rate(container_cpu_usage_seconds_total{name!=""}[5m])) * 100Then update your dashboards to use these pre-computed metrics:
# Original query100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Optimized query using recording rulenode:cpu_usage:avg5mThis significantly improves dashboard performance, especially with many concurrent users.
Best Practices for Production Deployments
Version Pinning for Stability
Always use specific versions instead of “latest” tags for production:
prometheus: image: prom/prometheus:v2.40.0 # Specific version, not 'latest' # ...
grafana: image: grafana/grafana:9.3.2 # Specific version # ...This ensures consistent behavior and prevents unexpected changes during updates.
Security Considerations
- Use Strong Passwords: Never use default credentials in production:
grafana: # ... environment: - GF_SECURITY_ADMIN_USER=admin - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD} # Use env variable - GF_USERS_ALLOW_SIGN_UP=false - GF_AUTH_ANONYMOUS_ENABLED=false- Network Restriction: Never expose monitoring services directly to the internet. Use a reverse proxy with authentication:
nginx: image: nginx:latest ports: - "443:443" volumes: - ./nginx/nginx.conf:/etc/nginx/conf.d/default.conf - ./nginx/certs:/etc/nginx/certs - ./nginx/.htpasswd:/etc/nginx/.htpasswd- Use Non-Root Users: Run containers with non-root users when possible:
grafana: # ... user: "472" # Grafana's built-in userStep 8: Launching the Monitoring Stack
Start your monitoring stack:
docker-compose up -dVerify all containers are running:
docker-compose psAccess the monitoring interfaces:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (login with admin/admin)
- cAdvisor: http://localhost:8080
- Alertmanager: http://localhost:9093
Verify Your Setup
In Prometheus, go to Status > Targets to confirm all targets show “UP” status.
In Grafana, navigate to your dashboards to see system metrics.
If you’re weighing orchestration options, take a look at our comparison of Docker vs. Docker Swarm key differences to see which fits your setup best.
Step 9: Monitoring Docker Containers
cAdvisor automatically collects container metrics. View these metrics in Prometheus with queries like:
- Container CPU:
sum by (name) (rate(container_cpu_usage_seconds_total{name!=""}[1m])) - Container Memory:
sum by (name) (container_memory_usage_bytes{name!=""}) - Container Disk I/O:
sum by (name) (rate(container_fs_reads_bytes_total{name!=""}[1m]))
Identifying Unhealthy Containers with Targeted Monitoring Queries
Add these queries to your monitoring to identify unhealthy containers:
- Restarting Containers:
sum by(name) (delta(container_start_time_seconds{name!=""}[15m])) > 0 - High Container CPU:
rate(container_cpu_usage_seconds_total{name!=""}[1m]) * 100 > 80 - OOM Risk:
container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""} * 100 > 80
Troubleshooting Prometheus and Grafana in Docker Compose
If you encounter problems:
No data in Grafana
- Check Prometheus targets in Status > Targets
- Verify Prometheus data source is working in Grafana
- Test queries directly in Prometheus UI
- Verify the time range in Grafana includes data collection period
Container not showing in metrics
- Ensure cAdvisor has access to Docker socket
- Check container is running with
docker ps - Verify metrics in Prometheus with
container_cpu_usage_seconds_total
Alerts not firing
- Check alert rules in Prometheus UI
- Verify Alertmanager configuration
- Test alert notifications manually
- Check that conditions persist long enough to trigger alerts (the
forduration)
Performance issues
- Reduce scrape frequency for less important targets
- Use recording rules for frequently queried expressions
- Adjust the retention period based on available disk space
- Consider using remote storage for longer retention
Next Steps
You’ve now got a solid monitoring setup for your Docker environment. From here, it’s easy to extend—add service-specific exporters, refine dashboards, and connect alerting tools your team already uses.
If you’re starting to feel the limits of managing this stack yourself—or just want something that’s easier to scale and maintain—this is where Last9 can help.
We work with teams at Probo, CleverTap, Replit, and others to handle high-cardinality observability at scale. With native support for OpenTelemetry and Prometheus, Last9 brings metrics, logs, and traces into one place—optimized for performance, cost, and real-time debugging. We’ve even monitored 11 of the 20 largest live-streaming events in history.
Let us handle the complexity, so you can focus on building. Book sometime with us today!
