Everything looks fine—dashboards are green, logs are quiet. But users start reporting slow response times. No errors, no traffic spikes. Just a general slowdown.
It’s a common situation. Not all problems show up as crashes or clear failures. Sometimes, performance degrades quietly, and standard metrics don’t catch it early.
But that's where Prometheus alerting can help, if you're monitoring the right signals.
In this guide, we’ll walk through Prometheus alert examples you can use in your setup—from basic infrastructure checks to custom alerts based on application behavior.
The Prometheus + Alertmanager Flow
Before we jump into examples, let’s quickly see how Prometheus handles alerting behind the scenes.
It’s two parts:
- Prometheus checks if your alert conditions are true using PromQL (that’s their query language).
- Alertmanager decides what to do with those alerts, like sending a message to Slack, email, or PagerDuty.
Your alert rules live in YAML files. Each rule says:
“If this thing is true for this long, send an alert.”
Here’s a simple example:
groups:
- name: example-alerts
rules:
- alert: HighCPUUsage
expr: cpu_usage_percentage > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"
Breaking it down:
- expr: The PromQL query that checks your condition (Is CPU usage above 80%?)
- for: How long the condition needs to stay true before firing (5 minutes here)
- labels: Extra info, like how serious this alert is
- annotations: Messages that help people understand what’s going on
Key Alerts for Tracking Infrastructure Performance
Infrastructure alerts are the alerts that warn you about issues with your servers before they get worse.
CPU and Memory Monitoring
High CPU usage usually means your system is under heavy load or a process is consuming more resources than expected. Here’s a simple alert to monitor CPU usage:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"
What’s happening here?
- The query calculates the percentage of CPU in use (not idle).
- If usage exceeds 85% for 10 minutes, the alert triggers.
- Labels and annotations provide context to the team receiving the alert.
This alert can help identify situations where CPU resources are heavily consumed, allowing you to take action before performance degrades.
Memory alerts are equally important. When memory usage is too high, applications may slow down or crash. Here’s an example alert for memory usage:
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
for: 5m
labels:
severity: critical
team: platform
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"
This alert fires if memory usage remains above 90% for 5 minutes. It helps identify issues like memory leaks or unexpected spikes before they cause application failures.
Early Warnings for Low Disk Space Conditions
Running out of disk space can cause your services to stop working quickly. This alert helps you catch low disk space early so you have time to clean up or add more storage:
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 < 10
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is {{ $value }}% full on {{ $labels.instance }} at {{ $labels.mountpoint }}"
What this does:
- It checks if available disk space is below 10% for at least 5 minutes.
- It ignores temporary file systems (like tmpfs) to avoid false alerts.
- Labels and annotations help your team quickly see which server and mount point are affected.
Application-Level Alerts That Impact User Experience
Here are some key alerts you should have in place:
HTTP Error Rate Monitoring
When your API starts returning too many errors (like 500s), that’s a red flag. You want to know quickly if error rates rise above a certain threshold.
For example, an alert can monitor the percentage of 5xx errors over a short time window and notify you if it goes above 5%. This helps you identify issues like a failing service or bad deployments early.
Response Time Alerts
Slow responses frustrate users. Monitoring your app’s response time can reveal when things start to slow down.
A common alert watches the 95th percentile of response times—basically, the slowest 5% of requests—and triggers if they go beyond a set limit (like half a second). This gives you a heads-up to investigate potential bottlenecks before users complain.
Database Connection Pool Monitoring
Your database connection pool controls how many simultaneous connections your app can use. If this pool gets maxed out, new requests can’t connect and might fail.
An alert here monitors how much of the pool is in use. If usage exceeds 90%, it warns you so you can take action—maybe increase the pool size or optimize queries.
Why are they important?
Infrastructure alerts show if servers are healthy. Application alerts reveal if your users are having a bad experience. Both are essential to keeping your system reliable and user-friendly.
How to Set Alerts for Business-Critical Metrics
Sometimes, you need alerts that are tailored to your specific business needs. These Prometheus alert examples focus on application scenarios unique to your use case.
Queue Length Monitoring
If your system uses message queues, monitoring their length is crucial. A growing queue might mean your application is falling behind in processing tasks.
For example, you could set an alert to warn if a queue has more than 1,000 messages waiting for 5 minutes or longer. This helps you spot processing bottlenecks early.
- alert: HighQueueLength
expr: queue_messages_total > 1000
for: 5m
labels:
severity: warning
team: backend
annotations:
summary: "Queue length is high"
description: "Queue {{ $labels.queue_name }} has {{ $value }} messages"
User Registration Rate
Tracking unusual spikes in user registrations can help detect either rapid growth or suspicious activity like bot attacks.
Here’s an example alert that triggers if the registration rate exceeds 100 users per hour for 10 minutes. This gives your product team a heads-up to investigate further.
- alert: UnusualUserRegistrationRate
expr: rate(user_registrations_total[1h]) > 100
for: 10m
labels:
severity: info
team: product
annotations:
summary: "Unusual user registration rate"
description: "User registration rate is {{ $value }} per hour"
Alerts to Track Container and Kubernetes Issues
If you use containers or Kubernetes, you need alerts for issues these systems face that you won’t find in regular host metrics.
Pod Restart Monitoring
Pods restarting often usually means a problem, like an app crash or resource issue. This alert checks if pods have restarted recently and warns you if it happens too often.
- alert: PodRestartingTooMuch
expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Pod {{ $labels.pod }} is restarting frequently"
description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} restarted {{ $value }} times in the last 15 minutes"
Container Resource Limits
Containers have limits on memory and CPU. When memory use gets close to the limit, it can cause problems. This alert warns if a container’s memory usage is above 90% of its limit for 5 minutes.
- alert: ContainerMemoryNearLimit
expr: (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.9
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Container memory usage near limit"
description: "Container {{ $labels.container }} in pod {{ $labels.pod }} is using {{ $value | humanizePercentage }} of its memory limit"
Kubernetes Node Monitoring
If a Kubernetes node is not ready for more than 10 minutes, it could cause failures. This alert lets you know if a node isn’t ready.
- alert: KubernetesNodeNotReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
team: platform
annotations:
summary: "Kubernetes node not ready"
description: "Node {{ $labels.node }} has been not ready for over 10 minutes"
Network and Connectivity Alerts
Network errors can cause issues. Track receive errors on network interfaces. This alert fires if receive errors are high over 5 minutes.
- alert: HighNetworkReceiveErrors
expr: rate(node_network_receive_errs_total[5m]) > 10
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "High network receive errors"
description: "Interface {{ $labels.device }} on {{ $labels.instance }} has {{ $value }} receive errors per second"
SSL Certificate Expiration
Expired SSL certificates cause outages. This alert warns 30 days before a certificate expires.
- alert: SSLCertificateExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "SSL certificate expiring soon"
description: "SSL certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"
Monitoring Service Dependencies and Predictive Alerts
Sometimes, your system depends on several services working together. It’s important to know when any critical service goes down.
Multi-Service Dependency Alerts
This alert checks if key services like your API, database, or cache are down. It waits for 1 minute before alerting, so you don’t get false alarms from brief outages.
- alert: CriticalServicesDown
expr: up{job=~"api|database|cache"} == 0
for: 1m
labels:
severity: critical
team: platform
annotations:
summary: "Critical service is down"
description: "Service {{ $labels.job }} on {{ $labels.instance }} is down"
Predictive Alerts
Instead of waiting for disk space to run out, you can predict when it will happen based on recent trends. This alert warns you if your disk is expected to be full within the next 4 hours.
- alert: DiskSpaceRunningOut
expr: predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
for: 5m
labels:
severity: warning
team: platform
annotations:
summary: "Disk space will run out soon"
description: "Disk on {{ $labels.instance }} is expected to be full in about 4 hours"
Context Switching and Performance
High rates of context switching on a node can signal performance problems. This alert notifies you if the number of context switches per second is unusually high over 10 minutes.
- alert: HighContextSwitching
expr: rate(node_context_switches_total[5m]) > 10000
for: 10m
labels:
severity: warning
team: platform
annotations:
summary: "High context switching detected"
description: "Node {{ $labels.instance }} is experiencing {{ $value }} context switches per second"
Common Challenges and How to Solve Them
Prometheus alerting is powerful, but you’ll often run into some common issues that can cause noisy alerts, missed problems, or confusing signals.
Here’s how to handle the most frequent ones.
Noisy Metrics — How to Avoid False Alarms
Some metrics naturally fluctuate a lot, like CPU usage during brief spikes or network traffic bursts. These jumps can cause alerts to trigger too often, making it hard to spot real issues. To smooth out this noise, use functions avg_over_time()
that take an average over a time window. This helps your alerts focus on sustained problems instead of temporary blips.
- alert: CPUUsageHighButSmoothed
expr: avg_over_time(node_cpu_usage_percentage[10m]) > 80
for: 15m
labels:
severity: warning
annotations:
summary: "Sustained high CPU usage"
description: "CPU usage has averaged {{ $value }}% over the last 10 minutes"
In this example, the alert only fires if CPU usage has stayed above 80% for 15 minutes, cutting down on noise from short spikes.
Missing Metrics — Identifying Silent Failures
When a service stops running or crashes, it often stops sending metrics altogether. This can leave you blind to the problem. Prometheus has the absent()
function to detect when expected metrics disappear. Use it to trigger alerts when data goes missing, so you know if a service is down or unreachable.
- alert: ServiceMetricsMissing
expr: absent(up{job="my-service"}) or up{job="my-service"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Service metrics missing"
description: "Haven't received metrics from {{ $labels.job }} for 5 minutes"
This alert tells you if Prometheus hasn’t received a heartbeat from your service for 5 minutes, helping you catch outages early.
Managing Alert Dependencies — Avoiding Confusing Alerts
Some alerts only make sense if the underlying service is running. For example, monitoring slow queries in a database is pointless if the database is down. By combining conditions, you can avoid alerts that don’t apply, keeping your alerts relevant and easier to act on.
- alert: DatabaseSlowQueries
expr: mysql_global_status_slow_queries > 10 and on() up{job="mysql"} == 1
for: 10m
labels:
severity: warning
annotations:
summary: "Database experiencing slow queries"
description: "MySQL has {{ $value }} slow queries"
Here, the alert only triggers if the database is up and slow queries have been detected for 10 minutes.
Best Practices for Writing Actionable and Sustainable Alerts
Setting up alerts is about making alerts meaningful, minimizing noise, and ensuring the right people take action at the right time.
Define Alert Severity Levels with Clear Response Expectations
Categorize alerts by severity to set expectations for how quickly teams should respond. This helps avoid alert fatigue and ensures critical issues get attention fast.
Severity | Expected Response Time | Typical Use Case |
---|---|---|
critical |
Immediately — interrupts sleep | Full service outage |
warning |
During business hours | High CPU/memory usage |
info |
Passive monitoring | Unusual but not harmful behavior |
Use Labels to Route Alerts and Add System Context
Labels aren’t just metadata — they’re essential for organizing alerts, filtering them in dashboards, and routing them to the right on-call team. Use a consistent set of labels across alerts for easier management.
labels:
severity: warning
team: backend
service: user-api
environment: production
runbook: "https://wiki.company.com/runbooks/user-api"
This structure helps:
- Route alerts to the correct Slack or PagerDuty team
- Enable alert deduplication and grouping
- Quickly locate relevant runbooks or Grafana dashboards
Write Annotations That Tell You Why the Alert Matters
The best alerts include annotations that explain what’s wrong and where to look next. Think of annotations as the "what now?" guide for whoever gets paged.
annotations:
summary: "{{ $labels.service }} is experiencing high latency"
description: "The 95th percentile latency for {{ $labels.service }} is {{ $value }}ms, which is above the 500ms threshold"
runbook_url: "https://wiki.company.com/runbooks/{{ $labels.service }}"
dashboard_url: "https://grafana.company.com/d/service-{{ $labels.service }}"
Make sure annotations:
- Include a short, human-readable summary of the issue
- Explain the metric threshold and its impact
- Link directly to runbooks and dashboards
Advanced PromQL for Alerting
Some alerting scenarios need more than basic threshold checks. These examples show how to handle gaps in data and avoid false positives during quiet traffic periods.
Detecting When a Service Stops Reporting Metrics
Sometimes a service goes down silently—no logs, no metrics. This alert checks for both explicit failures and missing data:
- alert: ServiceDown
expr: up == 0 or absent(up{job="my-service"})
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "No metrics received from {{ $labels.job }}. The service may be offline."
This pattern is useful when your alerting system depends on metric presence. absent()
helps catch silent failures.
Rate-Based Alerts With Traffic Thresholds
Error rates can spike when there’s barely any traffic, leading to noisy alerts. This query filters out those edge cases by adding a minimum traffic requirement:
- alert: HighErrorRateWithMinTraffic
expr: (rate(http_requests_total{status=~"5.."}[5m]) > 0.1) and (rate(http_requests_total[5m]) > 1)
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate with sufficient traffic"
description: "The service has a high rate of 5xx errors while handling more than 1 request/sec."
By adding a traffic floor (> 1 req/sec
), you reduce noise from low-volume endpoints that occasionally throw a 500.

Getting Started with Alerting with Last9
You don’t need to build alerting from scratch. Modern observability platforms simplify how you configure, manage, and scale Prometheus alerts.
If you're already using Prometheus, Last9 makes things easier. It’s fully PromQL-compatible and adds features like a real-time alert monitor, historical alert health, and the ability to correlate alerts with system events.
When evaluating an observability platform, look for:
- Native support for Prometheus and OpenTelemetry
- Support for high-cardinality data without performance hits
- Low-latency queries, even at scale
- Built-in tools to reduce false positives and alert fatigue
Start monitoring with Last9 today!
FAQs
How often should Prometheus evaluate alert rules?
Prometheus evaluates alert rules based on your global evaluation interval, typically every 15-30 seconds. You can adjust this in your Prometheus configuration, but shorter intervals increase resource usage.
What's the difference between for
duration and evaluation interval?
The for
duration specifies how long a condition must be true before firing an alert, while the evaluation interval determines how often Prometheus checks the condition. An alert with for: 5m
needs the condition to be true for five consecutive evaluations.
How do I prevent alert spam during outages?
Use Alertmanager's grouping and inhibition features. Group related alerts together and set up inhibition rules so that higher-severity alerts silence related lower-severity ones. Also, consider using longer evaluation periods during known maintenance windows.
What's the best way to organize alert rules in files?
Group related alerts together and use descriptive file names like infrastructure.yml
, application.yml
, and business-logic.yml
. Keep each file focused on a specific domain or service. Use consistent naming conventions for alert names that include the component and condition, like DatabaseConnectionPoolHigh
or APIResponseTimeSlow
.
How do I prevent alerts from firing during maintenance windows?
Use Alertmanager's silencing feature to temporarily suppress alerts. You can create silences manually through the web UI or programmatically via the API. For planned maintenance, consider creating automation that sets up silences before maintenance begins and removes them afterward. Alternatively, use external labels to distinguish between environments and route maintenance alerts differently.