Prometheus Alerting Examples for Developers

Everything looks fine—dashboards are green, logs are quiet. But users start reporting slow response times. No errors, no traffic spikes. Just a general slowdown.

It’s a common situation. Not all problems show up as crashes or clear failures. Sometimes, performance degrades quietly, and standard metrics don’t catch it early.

But that's where Prometheus alerting can help, if you're monitoring the right signals.

In this guide, we’ll walk through Prometheus alert examples you can use in your setup—from basic infrastructure checks to custom alerts based on application behavior.

The Prometheus + Alertmanager Flow

Before we jump into examples, let’s quickly see how Prometheus handles alerting behind the scenes.

It’s two parts:

Prometheus checks if your alert conditions are true using PromQL (that’s their query language).
Alertmanager decides what to do with those alerts, like sending a message to Slack, email, or PagerDuty.

Your alert rules live in YAML files. Each rule says:
“If this thing is true for this long, send an alert.”

Here’s a simple example:

groups:
  - name: example-alerts
    rules:
      - alert: HighCPUUsage
        expr: cpu_usage_percentage > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for more than 5 minutes"

Breaking it down:

expr: The PromQL query that checks your condition (Is CPU usage above 80%?)
for: How long the condition needs to stay true before firing (5 minutes here)
labels: Extra info, like how serious this alert is
annotations: Messages that help people understand what’s going on

💡

If your alerts need to work across multiple metrics, this guide on querying multiple metrics in Prometheus can help you structure those expressions better.

Key Alerts for Tracking Infrastructure Performance

Infrastructure alerts are the alerts that warn you about issues with your servers before they get worse.

CPU and Memory Monitoring

High CPU usage usually means your system is under heavy load or a process is consuming more resources than expected. Here’s a simple alert to monitor CPU usage:

- alert: HighCPUUsage
  expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
  for: 10m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "High CPU usage on {{ $labels.instance }}"
    description: "CPU usage is {{ $value }}% on {{ $labels.instance }}"

What’s happening here?

The query calculates the percentage of CPU in use (not idle).
If usage exceeds 85% for 10 minutes, the alert triggers.
Labels and annotations provide context to the team receiving the alert.

This alert can help identify situations where CPU resources are heavily consumed, allowing you to take action before performance degrades.

💡

If you're setting up CPU alerts from this guide, this CPU monitoring post covers what to track and why it matters.

Memory alerts are equally important. When memory usage is too high, applications may slow down or crash. Here’s an example alert for memory usage:

- alert: HighMemoryUsage
  expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
  for: 5m
  labels:
    severity: critical
    team: platform
  annotations:
    summary: "High memory usage on {{ $labels.instance }}"
    description: "Memory usage is {{ $value }}% on {{ $labels.instance }}"

This alert fires if memory usage remains above 90% for 5 minutes. It helps identify issues like memory leaks or unexpected spikes before they cause application failures.

Early Warnings for Low Disk Space Conditions

Running out of disk space can cause your services to stop working quickly. This alert helps you catch low disk space early so you have time to clean up or add more storage:

- alert: DiskSpaceLow
  expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 < 10
  for: 5m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "Low disk space on {{ $labels.instance }}"
    description: "Disk space is {{ $value }}% full on {{ $labels.instance }} at {{ $labels.mountpoint }}"

What this does:

It checks if available disk space is below 10% for at least 5 minutes.
It ignores temporary file systems (like tmpfs) to avoid false alerts.
Labels and annotations help your team quickly see which server and mount point are affected.

Application-Level Alerts That Impact User Experience

Here are some key alerts you should have in place:

HTTP Error Rate Monitoring

When your API starts returning too many errors (like 500s), that’s a red flag. You want to know quickly if error rates rise above a certain threshold.

For example, an alert can monitor the percentage of 5xx errors over a short time window and notify you if it goes above 5%. This helps you identify issues like a failing service or bad deployments early.

Response Time Alerts

Slow responses frustrate users. Monitoring your app’s response time can reveal when things start to slow down.

A common alert watches the 95th percentile of response times—basically, the slowest 5% of requests—and triggers if they go beyond a set limit (like half a second). This gives you a heads-up to investigate potential bottlenecks before users complain.

Database Connection Pool Monitoring

Your database connection pool controls how many simultaneous connections your app can use. If this pool gets maxed out, new requests can’t connect and might fail.

An alert here monitors how much of the pool is in use. If usage exceeds 90%, it warns you so you can take action—maybe increase the pool size or optimize queries.

Why are they important?

Infrastructure alerts show if servers are healthy. Application alerts reveal if your users are having a bad experience. Both are essential to keeping your system reliable and user-friendly.

💡

Writing PromQL alerts? It’s worth checking that your metric endpoints are giving you everything you need - here’s what to look for.

How to Set Alerts for Business-Critical Metrics

Sometimes, you need alerts that are tailored to your specific business needs. These Prometheus alert examples focus on application scenarios unique to your use case.

Queue Length Monitoring

If your system uses message queues, monitoring their length is crucial. A growing queue might mean your application is falling behind in processing tasks.

For example, you could set an alert to warn if a queue has more than 1,000 messages waiting for 5 minutes or longer. This helps you spot processing bottlenecks early.

- alert: HighQueueLength
  expr: queue_messages_total > 1000
  for: 5m
  labels:
    severity: warning
    team: backend
  annotations:
    summary: "Queue length is high"
    description: "Queue {{ $labels.queue_name }} has {{ $value }} messages"

User Registration Rate

Tracking unusual spikes in user registrations can help detect either rapid growth or suspicious activity like bot attacks.

Here’s an example alert that triggers if the registration rate exceeds 100 users per hour for 10 minutes. This gives your product team a heads-up to investigate further.

- alert: UnusualUserRegistrationRate
  expr: rate(user_registrations_total[1h]) > 100
  for: 10m
  labels:
    severity: info
    team: product
  annotations:
    summary: "Unusual user registration rate"
    description: "User registration rate is {{ $value }} per hour"

Alerts to Track Container and Kubernetes Issues

If you use containers or Kubernetes, you need alerts for issues these systems face that you won’t find in regular host metrics.

Pod Restart Monitoring

Pods restarting often usually means a problem, like an app crash or resource issue. This alert checks if pods have restarted recently and warns you if it happens too often.

- alert: PodRestartingTooMuch
  expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 0
  for: 5m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "Pod {{ $labels.pod }} is restarting frequently"
    description: "Pod {{ $labels.pod }} in namespace {{ $labels.namespace }} restarted {{ $value }} times in the last 15 minutes"

Container Resource Limits

Containers have limits on memory and CPU. When memory use gets close to the limit, it can cause problems. This alert warns if a container’s memory usage is above 90% of its limit for 5 minutes.

- alert: ContainerMemoryNearLimit
  expr: (container_memory_working_set_bytes / container_spec_memory_limit_bytes) > 0.9
  for: 5m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "Container memory usage near limit"
    description: "Container {{ $labels.container }} in pod {{ $labels.pod }} is using {{ $value | humanizePercentage }} of its memory limit"

Kubernetes Node Monitoring

If a Kubernetes node is not ready for more than 10 minutes, it could cause failures. This alert lets you know if a node isn’t ready.

- alert: KubernetesNodeNotReady
  expr: kube_node_status_condition{condition="Ready",status="true"} == 0
  for: 10m
  labels:
    severity: critical
    team: platform
  annotations:
    summary: "Kubernetes node not ready"
    description: "Node {{ $labels.node }} has been not ready for over 10 minutes"

Network and Connectivity Alerts

Network errors can cause issues. Track receive errors on network interfaces. This alert fires if receive errors are high over 5 minutes.

- alert: HighNetworkReceiveErrors
  expr: rate(node_network_receive_errs_total[5m]) > 10
  for: 5m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "High network receive errors"
    description: "Interface {{ $labels.device }} on {{ $labels.instance }} has {{ $value }} receive errors per second"

SSL Certificate Expiration

Expired SSL certificates cause outages. This alert warns 30 days before a certificate expires.

- alert: SSLCertificateExpiringSoon
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
  for: 5m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "SSL certificate expiring soon"
    description: "SSL certificate for {{ $labels.instance }} expires in {{ $value | humanizeDuration }}"

💡

Writing alert rules often needs more than just basic queries — these PromQL tricks can help you express complex conditions better.

Monitoring Service Dependencies and Predictive Alerts

Sometimes, your system depends on several services working together. It’s important to know when any critical service goes down.

Multi-Service Dependency Alerts

This alert checks if key services like your API, database, or cache are down. It waits for 1 minute before alerting, so you don’t get false alarms from brief outages.

- alert: CriticalServicesDown
  expr: up{job=~"api|database|cache"} == 0
  for: 1m
  labels:
    severity: critical
    team: platform
  annotations:
    summary: "Critical service is down"
    description: "Service {{ $labels.job }} on {{ $labels.instance }} is down"

Predictive Alerts

Instead of waiting for disk space to run out, you can predict when it will happen based on recent trends. This alert warns you if your disk is expected to be full within the next 4 hours.

- alert: DiskSpaceRunningOut
  expr: predict_linear(node_filesystem_free_bytes[1h], 4*3600) < 0
  for: 5m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "Disk space will run out soon"
    description: "Disk on {{ $labels.instance }} is expected to be full in about 4 hours"

Context Switching and Performance

High rates of context switching on a node can signal performance problems. This alert notifies you if the number of context switches per second is unusually high over 10 minutes.

- alert: HighContextSwitching
  expr: rate(node_context_switches_total[5m]) > 10000
  for: 10m
  labels:
    severity: warning
    team: platform
  annotations:
    summary: "High context switching detected"
    description: "Node {{ $labels.instance }} is experiencing {{ $value }} context switches per second"

Common Challenges and How to Solve Them

Prometheus alerting is powerful, but you’ll often run into some common issues that can cause noisy alerts, missed problems, or confusing signals.

Here’s how to handle the most frequent ones.

Noisy Metrics — How to Avoid False Alarms

Some metrics naturally fluctuate a lot, like CPU usage during brief spikes or network traffic bursts. These jumps can cause alerts to trigger too often, making it hard to spot real issues. To smooth out this noise, use functions avg_over_time() that take an average over a time window. This helps your alerts focus on sustained problems instead of temporary blips.

- alert: CPUUsageHighButSmoothed
  expr: avg_over_time(node_cpu_usage_percentage[10m]) > 80
  for: 15m
  labels:
    severity: warning
  annotations:
    summary: "Sustained high CPU usage"
    description: "CPU usage has averaged {{ $value }}% over the last 10 minutes"

In this example, the alert only fires if CPU usage has stayed above 80% for 15 minutes, cutting down on noise from short spikes.

Missing Metrics — Identifying Silent Failures

When a service stops running or crashes, it often stops sending metrics altogether. This can leave you blind to the problem. Prometheus has the absent() function to detect when expected metrics disappear. Use it to trigger alerts when data goes missing, so you know if a service is down or unreachable.

- alert: ServiceMetricsMissing
  expr: absent(up{job="my-service"}) or up{job="my-service"} == 0
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "Service metrics missing"
    description: "Haven't received metrics from {{ $labels.job }} for 5 minutes"

This alert tells you if Prometheus hasn’t received a heartbeat from your service for 5 minutes, helping you catch outages early.

💡

To build smarter alerts or custom dashboards, this Prometheus API guide breaks down how to query and fetch metrics programmatically.

Managing Alert Dependencies — Avoiding Confusing Alerts

Some alerts only make sense if the underlying service is running. For example, monitoring slow queries in a database is pointless if the database is down. By combining conditions, you can avoid alerts that don’t apply, keeping your alerts relevant and easier to act on.

- alert: DatabaseSlowQueries
  expr: mysql_global_status_slow_queries > 10 and on() up{job="mysql"} == 1
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: "Database experiencing slow queries"
    description: "MySQL has {{ $value }} slow queries"

Here, the alert only triggers if the database is up and slow queries have been detected for 10 minutes.

Best Practices for Writing Actionable and Sustainable Alerts

Setting up alerts is about making alerts meaningful, minimizing noise, and ensuring the right people take action at the right time.

Define Alert Severity Levels with Clear Response Expectations

Categorize alerts by severity to set expectations for how quickly teams should respond. This helps avoid alert fatigue and ensures critical issues get attention fast.

Severity	Expected Response Time	Typical Use Case
`critical`	Immediately — interrupts sleep	Full service outage
`warning`	During business hours	High CPU/memory usage
`info`	Passive monitoring	Unusual but not harmful behavior

Use Labels to Route Alerts and Add System Context

Labels aren’t just metadata — they’re essential for organizing alerts, filtering them in dashboards, and routing them to the right on-call team. Use a consistent set of labels across alerts for easier management.

labels:
  severity: warning
  team: backend
  service: user-api
  environment: production
  runbook: "https://wiki.company.com/runbooks/user-api"

This structure helps:

Route alerts to the correct Slack or PagerDuty team
Enable alert deduplication and grouping
Quickly locate relevant runbooks or Grafana dashboards

Write Annotations That Tell You Why the Alert Matters

The best alerts include annotations that explain what’s wrong and where to look next. Think of annotations as the "what now?" guide for whoever gets paged.

annotations:
  summary: "{{ $labels.service }} is experiencing high latency"
  description: "The 95th percentile latency for {{ $labels.service }} is {{ $value }}ms, which is above the 500ms threshold"
  runbook_url: "https://wiki.company.com/runbooks/{{ $labels.service }}"
  dashboard_url: "https://grafana.company.com/d/service-{{ $labels.service }}"

Make sure annotations:

Include a short, human-readable summary of the issue
Explain the metric threshold and its impact
Link directly to runbooks and dashboards

Advanced PromQL for Alerting

Some alerting scenarios need more than basic threshold checks. These examples show how to handle gaps in data and avoid false positives during quiet traffic periods.

Detecting When a Service Stops Reporting Metrics

Sometimes a service goes down silently—no logs, no metrics. This alert checks for both explicit failures and missing data:

- alert: ServiceDown
  expr: up == 0 or absent(up{job="my-service"})
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Service {{ $labels.job }} is down"
    description: "No metrics received from {{ $labels.job }}. The service may be offline."

This pattern is useful when your alerting system depends on metric presence. absent() helps catch silent failures.

Rate-Based Alerts With Traffic Thresholds

Error rates can spike when there’s barely any traffic, leading to noisy alerts. This query filters out those edge cases by adding a minimum traffic requirement:

- alert: HighErrorRateWithMinTraffic
  expr: (rate(http_requests_total{status=~"5.."}[5m]) > 0.1) and (rate(http_requests_total[5m]) > 1)
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High error rate with sufficient traffic"
    description: "The service has a high rate of 5xx errors while handling more than 1 request/sec."

By adding a traffic floor (> 1 req/sec), you reduce noise from low-volume endpoints that occasionally throw a 500.

Getting Started with Alerting with Last9

You don’t need to build alerting from scratch. Modern observability platforms simplify how you configure, manage, and scale Prometheus alerts.

If you're already using Prometheus, Last9 makes things easier. It’s fully PromQL-compatible and adds features like a real-time alert monitor, historical alert health, and the ability to correlate alerts with system events.

When evaluating an observability platform, look for:

Native support for Prometheus and OpenTelemetry
Support for high-cardinality data without performance hits
Low-latency queries, even at scale
Built-in tools to reduce false positives and alert fatigue

Start monitoring with Last9 today!

FAQs

How often should Prometheus evaluate alert rules?

Prometheus evaluates alert rules based on your global evaluation interval, typically every 15-30 seconds. You can adjust this in your Prometheus configuration, but shorter intervals increase resource usage.

What's the difference between for duration and evaluation interval?

The for duration specifies how long a condition must be true before firing an alert, while the evaluation interval determines how often Prometheus checks the condition. An alert with for: 5m needs the condition to be true for five consecutive evaluations.

How do I prevent alert spam during outages?

Use Alertmanager's grouping and inhibition features. Group related alerts together and set up inhibition rules so that higher-severity alerts silence related lower-severity ones. Also, consider using longer evaluation periods during known maintenance windows.

What's the best way to organize alert rules in files?

Group related alerts together and use descriptive file names like infrastructure.yml, application.yml, and business-logic.yml. Keep each file focused on a specific domain or service. Use consistent naming conventions for alert names that include the component and condition, like DatabaseConnectionPoolHigh or APIResponseTimeSlow.

How do I prevent alerts from firing during maintenance windows?

Use Alertmanager's silencing feature to temporarily suppress alerts. You can create silences manually through the web UI or programmatically via the API. For planned maintenance, consider creating automation that sets up silences before maintenance begins and removes them afterward. Alternatively, use external labels to distinguish between environments and route maintenance alerts differently.