Look, we need to talk about your alerting system. You know, that thing that wakes you up at 3 AM because a CPU somewhere is running at 82% instead of the usual 80%. Yeah, that one.
π‘ Quick Win: Before reading further, check your alert history. How many alerts last month actually required immediate action?
After years of being on-call (and losing countless hours of sleep), I've learned a hard truth: most alerts are worse than useless. They're actively harmful because they train you to ignore the important stuff.
You might be thinking, "But all my alerts are important!" I thought the same thing too. Then I noticed our team was ignoring alerts because there were just too many. Let me show you a better way...
2. Core Principles
Your users care about exactly four things:
The site is up
It's fast
Their data is accurate and fresh
All features work as expected
β οΈ Warning: If your alert doesn't tie directly to one of these four things, it's probably noise.
The Three Laws of Alerting
Every alert must be actionable. If your response is "huh, that's weird" and you go back to sleep β it shouldn't be an alert.
Monitor symptoms, not causes. When your house is on fire, you want the smoke alarm to go off, not a notification about unusual living room temperatures.
Pages must be urgent AND important. If it can wait until morning, it's not a page.
π― Pro Tip: Print these three laws. Put them next to your monitoring dashboards. Reference them every time you create a new alert.
Theory is great, but let's look at how this plays out in the real world. I've helped transform alerting systems at several companies, and these examples will show you exactly what changed and why it worked.
But wait, it gets better. While the e-commerce example is impressive, here's what happened when we applied the same principles to a completely different type of service...
Case Study 2: The Media Streaming Service
Before (15 alerts/day):
- CDN error rate > 1%
- Video encoder CPU > 85%
- Storage latency > 50ms
- Transcoder queue > 500
After (2 alerts/day):
- Video start success rate in San Francisco < 99.5%
- Playback buffer ratio in Seattle > 0.1%
Results:
- User complaints: β¬οΈ 60%
- Time to resolution: β¬οΈ 45%
- Engineer burnout: Significantly reduced
π Key Learning: Notice how each "After" alert directly maps to user experience?
Now that you've seen what's possible, you're probably wondering, "How do I get there?" Don't worry, I've got you covered. Here's your week-by-week playbook for alert transformation...
4. Implementation Guide
Week 1: The Alert Transformation Plan
Day 1-2: Audit & Analysis
[ ] List all alerts that fired last month
[ ] Categorize by user impact (None, Low, Medium, High)
[ ] Calculate false positive rates
[ ] Identify redundant alerts
Day 3-4: Alert Redesign
[ ] Map causes to symptoms
[ ] Design new symptom-based alerts
[ ] Create decision trees for escalation
[ ] Document expected actions
Day 5-7: Implementation & Validation
[ ] Deploy new alerts in parallel
[ ] Validate coverage (Use last9.io coverage tool to ensure its discovering new elements)
[ ] Train team on new approach
[ ] Remove old alerts
π‘ Quick Win: Start with your noisiest alerts first. Quick victories build momentum.
Alert Decision Tree
Alerts need a response
Response
SRE Book
Delivered to
Based on
Immediate
Alerts
Pager
Symptoms
Act eventually
Tickets
Issue Tracker / Email
Symptoms or Causes
None (Diagnostics Only)
Logs
Dashboards
Causes
Even with the best intentions, certain alert patterns keep cropping up like weeds in a garden. Let's look at the most common ones I've encountered and how to root them out permanently...
5. Troubleshooting Common Pitfalls
Infrastructure Obsession Patterns
1. The CPU Alert Trap
Before: Alert on CPU > 80%
Wakes you up when batch jobs run
Triggers during normal traffic spikes
Doesn't catch real problems
Problem:
Modern CPUs are designed to run at high utilization
Auto-scaling systems make this metric less relevant (ensure that is setup correctly)
No direct correlation with user experience
Different services have different CPU patterns
Better Approach:
Alert on service latency increase
Monitor error rates
Track request success rates
Set up Golden Signals monitoring
2. The Memory Usage Maze
Before: Alert on memory > 90%
JVM applications look like they're always about to crash
Triggers during normal garbage collection
Causes panic over normal behavior
Problem:
Modern systems handle memory dynamically
Garbage collection makes usage patterns complex
Different applications have different memory profiles
High memory usage often means efficient caching
Better Approach:
Alert on OOM events
Monitor application errors
Track swap usage trends
Watch for memory leak patterns
π One team reduced their memory-related alerts by 95% by focusing on application errors instead of usage percentages.
β οΈ Warning Signs You're Doing It Wrong:Your alerts trigger at the same time every dayNobody responds to alerts immediately anymoreYou have more alerts than servicesYour runbooks are longer than your code
7. The Latency Layer Cake
Before: Alert on latency > 100ms at every layer
Database latency alerts
API latency alerts
Cache latency alerts
Network latency alerts
Problem:
Creates alert storms
Ignores system interactions
Misses real user impact
Causes duplicate investigations
Better Approach:
Monitor end-to-end latency
Track latency changes
Alert on user experience
Use latency budgets
How to Fix These Anti-patterns:
Audit Your Current Alerts
For each alert, ask: - Has this ever caught a real issue? - How many false positives last month? - What action was taken when it fired? - Could we catch this another way?
If you're not going to wake up for it, it's not an alert
Dashboards are for patterns, alerts are for problems
The best alert is the one you never need to send
π― Pro Tip: Keep a "graveyard dashboard" of deleted alerts with metrics showing how many false positives you avoided. It helps justify the changes to everyone.
6. Maintenance and Evolution
Getting your alerts under control is just the beginning. Here's how to keep them that way...
Monthly Checklist
[ ] Review alert frequency
[ ] Check false positive rates
[ ] Update thresholds based on business changes
[ ] Remove unused alerts
[ ] Operational Review include 'deleted alerts' just like 'deleted code'
Quarterly Review
[ ] Full alert audit
[ ] Update runbooks
[ ] Team feedback session
[ ] Adjust based on post-mortems
Key Metrics to Track
Alert-to-incident ratio
Mean time to acknowledge
False positive rate
Alert frequency per service
π― Pro Tip: Keep a "graveyard" dashboard of deleted alerts. It's motivating to see how far you've come.
The Final Word
Your monitoring system should be like a good assistant: it only interrupts you for things that truly matter. Everything else can wait for office hours.
Action Items For Tomorrow
Count your current alerts
Pick your noisiest alert
Map it to user impact
Transform or delete it
Measure the results
Remember:
Users don't care about your infrastructure
They care about their experience
Monitor what they care about
Use causes for diagnosis, not pages
May your pager be quiet, and your sleep be deep.
Written by someone who's deleted more alerts than they've created, and sleeps better for it.
We celebrate when code is deleted regularly, we should also celebrate when alerts are too.