Vibe monitoring with Last9 MCP: Ask your agent to fix production issues! Setup →
Last9 Last9

May 7th, ‘25 / 8 min read

Kubernetes Alerting That Won’t Burn You Out

A practical guide to Kubernetes alerting—cut the noise, catch what matters, and avoid those unnecessary 3AM wake-up calls.

Kubernetes Alerting That Won’t Burn You Out

Kubernetes production environments require robust alerting to catch problems before they impact users. While monitoring shows system state, proper alerting tells you when something needs attention.

This guide outlines 15 key Kubernetes alerts that help DevOps teams avoid outages and minimize downtime. For each alert, we provide implementation guidance and troubleshooting steps to resolve common issues quickly.

Whether you manage a single cluster or a complex multi-cluster environment, these alerts form the foundation of a reliable Kubernetes monitoring strategy.

Basic Elements of Effective Kubernetes Alerting

Kubernetes alerting is your early warning system for cluster issues. It works by continuously checking metrics against thresholds you define. When something crosses that line, you get notified.

The typical Kubernetes alerting flow looks like this:

  1. Collect metrics from your cluster
  2. Define alert rules with specific thresholds
  3. Send notifications when thresholds are violated
  4. Investigate and resolve the issue

Simple enough in theory, but the real skill lies in knowing which metrics matter and what thresholds make sense. That's where most teams stumble.

💡
When resolving pod issues detected by your alerting system, you might need to restart Kubernetes pods using kubectl commands to restore normal service operation. Our guide covers it all!

Setting Up Your Alerting Tools

Before we get into the specific alerts, let’s make sure your toolbox is ready. You’ll need:

  • Metrics collection – Prometheus is the go-to here. Think of it as your cluster’s flight recorder.
  • Alerting rules engine – AlertManager is a solid choice, but if you’re dealing with high-cardinality data or want more control, Last9’s Alert Studio makes tuning alerts a lot easier.
  • Notification channels – Slack, email, PagerDuty… whatever keeps your team in the loop.
  • Observability platform – Tools like Last9 bring everything together—metrics, logs, and traces—so you’re not jumping between dashboards.

Last9 supports Prometheus and OpenTelemetry, scales well with traffic, and is built for teams who care about fast detection without alert fatigue.

Now, let's get to the meat of it – the alerts themselves.

15 Must-Have Kubernetes Alerts for Proactive Monitoring

1. Node CPU Overload Detection and Prevention

Alert when node CPU consistently exceeds 80% for 5+ minutes. This is your first line of defense against resource starvation.

Alert Rule Example:

sum(rate(node_cpu_seconds_total{mode!="idle"}[5m])) by (instance) / sum(rate(node_cpu_seconds_total[5m])) by (instance) > 0.8

Troubleshooting: Check which pods are the CPU hogs using kubectl top pods --all-namespaces. Consider scaling out your deployment or implementing pod resource limits.

2. Memory Saturation Warning System

Trigger when node memory usage passes 85% for 5+ minutes. Memory issues can cause OOM kills, which are never fun.

Troubleshooting: Run kubectl describe node <nodename> to see memory allocation. Look for memory leaks in your applications or increase memory limits.

3. Pod Restart Loop Detection

Alert when pods enter the CrashLoopBackOff state. This means your pods are repeatedly failing and restarting – a clear sign that something's wrong.

Troubleshooting: Check logs kubectl logs <pod-name> to identify the failure reason. Common culprits include configuration errors, resource limits, or application bugs.

4. Pod Pending Too Long

Notify when pods remain Pending for 10+ minutes. Pods shouldn't be homeless for long.

Troubleshooting: Run kubectl describe pod <pod-name> to see why scheduling failed. Often, this indicates resource constraints or node selector issues.

💡
Last9's Alert Studio identifies patterns across infrastructure, application, and business metrics while connecting configuration and environment changes to system health issues. Check our docs for setup!

5. Container Restart Spike

Alert when container restarts exceed normal baseline by 300% over 15 minutes. Flapping containers point to unstable applications.

Troubleshooting: Check logs before and after restarts to find patterns. Look for network timeouts, resource constraints, or dependent service failures.

6. Node Disk Usage High

Trigger when node disk usage exceeds 85% for 15+ minutes. Running out of disk space is a quick way to crash your node.

Troubleshooting: Use kubectl describe node <nodename> and check which pods might be writing large files. Consider adding storage or cleaning up logs and unused images.

7. PersistentVolume Fill Rate Abnormal

Alert when PV usage growth rate exceeds historical norms by 200%. Sudden storage growth often indicates problems.

Troubleshooting: Check which applications are writing to the volume and investigate any unusual activity or logs.

8. Deployment Replicas Mismatch

Notify when the desired replica count doesn't match available replicas for 15+ minutes. This means your scaling isn't working as expected.

Troubleshooting: Run kubectl describe deployment <name> to see why pods aren't scaling properly. Check for resource constraints or failed health checks.

9. Pod OOM Killed

Alert when pods are terminated due to out-of-memory errors. These kills happen without warning and can cause unexpected downtime.

Troubleshooting: Review memory limits and actual usage patterns. Consider increasing limits or optimizing your application's memory consumption.

10. Kubernetes API Server Latency High

Trigger when API server latency exceeds 500ms for 5+ minutes. A sluggish control plane can affect your entire cluster.

Troubleshooting: Check API server logs and consider scaling your control plane if you're on a managed service, or investigate etcd performance issues.

11. etcd High Leader Change Rate

Alert when etcd elects more than 3 leaders in 10 minutes. Frequent leader elections indicate network or node stability issues.

Troubleshooting: Look for network problems between control plane nodes or resource constraints on etcd pods.

12. Job Completion Time Anomaly

Notify when critical jobs take 50% longer than their historical average. Slow jobs often indicate deeper problems.

Troubleshooting: Check logs for the specific job run and compare with previous successful runs to spot differences.

13. HorizontalPodAutoscaler Not Scaling

Alert when HPA hasn't scaled despite metrics exceeding thresholds for 15+ minutes. Your auto-scaling might be broken.

Troubleshooting: Verify HPA configuration and check if metrics are being collected correctly with kubectl describe hpa <name>.

14. Ingress Latency Spike

Trigger when ingress request latency jumps 200% above baseline for 5+ minutes. Slow responses hurt user experience.

Troubleshooting: Check backend service performance, network issues, or ingress controller resource constraints.

15. Certificate Expiration Approaching

Alert when TLS certificates are within 14 days of expiration. Nobody wants SSL errors greeting their users.

Troubleshooting: Renew certificates immediately or verify your cert-manager is working correctly if using automatic renewal.

💡
Now, fix production Kubernetes alert issues instantly—right from your IDE, with AI and Last9 MCP. Bring real-time context—logs, metrics, and traces—into your local environment to troubleshoot and resolve faster.

How to Organize Alerts by Priority

Not all alerts are created equal. Here's how to categorize them:

Priority Response Time Notification Example Alerts
P0 (Critical) Immediate (24/7) Call + SMS + Slack Node down, Ingress down, API server unavailable
P1 (High) < 30 mins SMS + Slack High CPU/Memory, Pod crash loops, OOM kills
P2 (Medium) < 2 hours Slack Deployment replica mismatch, Job anomalies
P3 (Low) Next business day Email Certificate expiration (early warning), Disk usage trending up

Fine-Tuning Alerts for Your Environment

The "right" thresholds depend on your specific workloads. Here's how to fine-tune:

  1. Establish baselines: Monitor normal patterns for at least two weeks
  2. Start conservative: Begin with broader thresholds, then tighten them
  3. Review alert history: Adjust thresholds to reduce noise without missing important events
  4. Consider time patterns: Some alerts might need different thresholds during different times of day

Remember – the goal isn't just to know when things break, but to catch issues before they impact users.

Using SLOs for More Meaningful Alerts

SLO (Service Level Objective) based alerting is next-level Kubernetes monitoring. Instead of alerting on symptoms, you alert on what matters – user experience.

Here's how to implement it:

  1. Define SLOs: Set realistic targets like "API requests complete in under 300ms for 99.9% of requests."
  2. Create error budgets: Calculate how much downtime/latency you can tolerate
  3. Alert on budget burn: Trigger alerts when you're consuming your error budget too quickly

This approach reduces alert noise dramatically. You're not paged because CPU hit 82% – you're paged because actual user experience is degrading faster than acceptable.

A simple SLO alert might look like:

# Alert when error budget burn rate is too high
sum(rate(http_request_duration_seconds_count{status=~"5.."}[1h])) 
  / 
sum(rate(http_request_duration_seconds_count[1h])) 
  > 
(1 - 0.995) * 24 # Burning 24x faster than allowed

SLO-based alerting works best for user-facing services where you can clearly define what "good" looks like.

💡
When implementing SLO-based alerts, consider using OpenTelemetry metrics for Kubernetes autoscaling to improve resource management based on actual service performance.

How to Prevent Alert Fatigue

Alert fatigue is real. Too many notifications, and people start ignoring them all. Here's how to keep things sane:

  1. Group-related alerts: Cluster similar problems instead of separate notifications
  2. Implement alert hierarchies: Have parent/child relationships so you don't get 50 alerts for one root cause
  3. Add context: Include troubleshooting links in your alerts
  4. Use runbooks: Attach standard procedures to common alerts
  5. Schedule regular review: Audit your alerts quarterly to remove outdated ones

False positives burn out teams faster than actual incidents. Stay vigilant about tuning your system.

Setting Up Cost Monitoring Alerts

Your finance team will thank you for this one. Cost-related alerts can save serious cash before things spiral out of control.

Consider setting up alerts for:

  1. Resource quota approaching: Alert when namespace resource usage exceeds 80% of quota
  2. Idle resource detection: Flag when expensive resources (GPUs, high-memory nodes) are underutilized
  3. Sudden cost spikes: Alert when overall cluster cost increases by >20% day-over-day
  4. Abandoned resources: Identify PVCs, load balancers, or other billable resources no longer in use

A simple Prometheus query for idle pod detection might look like:

# Find pods using less than 10% of requested CPU consistently
sum(rate(container_cpu_usage_seconds_total[24h])) by (pod) 
  / 
sum(kube_pod_container_resource_requests{resource="cpu"}) by (pod) 
  < 0.1

These alerts won't wake you at 3 AM, but they'll make for much happier budget meetings.

💡
Understanding the differences between pod types is essential for creating appropriate alerting rules - learn more about various types of pods in Kubernetes and how they affect your monitoring strategy.

Managing Alerts Across Multiple Clusters

As your Kubernetes footprint grows, you'll likely end up with multiple clusters. Here's how to handle alerting across them:

  1. Centralized Monitoring: Use a single Prometheus instance to scrape metrics from all clusters
  2. Federated Approach: Deploy Prometheus in each cluster and have a central instance aggregate
  3. Cluster Labels: Tag every alert with cluster metadata for clear identification
  4. Consistent Naming: Use identical alert names across environments for easier correlation
  5. Hierarchical View: Group alerts by cluster, namespace, and application

Tools like Last9 can help here – they're designed to handle complexity without the headache. Their platform correlates incidents across your infrastructure, not just within a single cluster.

The key is striking the right balance: enough isolation to manage clusters independently, but enough centralization to avoid alert overload.

Conclusion

Good alerting isn’t about catching everything—it’s about catching the right things before they turn into real problems. These 15 alerts are a solid starting point. Tweak them to fit your setup, and you’ll spend fewer nights chasing false alarms.

If your alerts are noisy or breaking under high-cardinality data, Last9 can help. We’re built for environments where labels explode and traditional setups struggle.

With native support for OpenTelemetry and Prometheus, we bring together metrics, logs, and traces — so you get real-time, cost-effective, and context-rich alerts that help.

Talk to us to know more about Last9 or get started for free today!

FAQs

What's the difference between monitoring and alerting in Kubernetes?

Monitoring is about collecting and visualizing data, while alerting is about notifying humans when specific conditions occur. Think of monitoring as your security camera footage and alerting as the alarm that wakes you up when someone's breaking in.

How do I handle alert storms during major outages?

Implement alert grouping and correlation to reduce noise. Define parent/child relationships between alerts so you only get notified about the root cause, not every symptom. Tools like Last9 help by correlating related alerts automatically.

Should I alert on everything I monitor?

Alert on what requires human attention and action. If it's just informational or doesn't need immediate response, keep it in dashboards but out of your notification channels.

How do I know if my alert thresholds are appropriate?

Start by baselining normal behavior for at least two weeks. Set initial thresholds with a buffer above normal peaks. Then review your alert history regularly – too many false alarms mean thresholds are too tight; missed incidents mean they're too loose.

Can I use the same alerting setup across all environments?

While your alerting structure should be consistent, thresholds often need adjustment between production and non-production environments. Production typically needs tighter thresholds and faster response times.

How do I implement GitOps for alert management?

Store your Prometheus alert rules and AlertManager configurations in Git. Use tools like Prometheus Operator with custom resources that can be applied via your CD pipeline. This approach ensures alert rules are versioned, peer-reviewed, and consistently applied across all clusters.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Anjali Udasi

Anjali Udasi

Helping to make the tech a little less intimidating. I love breaking down complex concepts into easy-to-understand terms.