Table of Contents
- Introduction: The Alert Problem
- Core Principles
- Real-world Transformations
- Implementation Guide
- Troubleshooting Common Pitfalls
- Maintenance and Evolution
1. Introduction: The Alert Problem
Look, we need to talk about your alerting system. You know, that thing that wakes you up at 3 AM because a CPU somewhere is running at 82% instead of the usual 80%. Yeah, that one.
💡 Quick Win: Before reading further, check your alert history. How many alerts last month actually required immediate action?
After years of being on-call (and losing countless hours of sleep), I've learned a hard truth: most alerts are worse than useless. They're actively harmful because they train you to ignore the important stuff.
You might be thinking, "But all my alerts are important!" I thought the same thing too. Then I noticed our team was ignoring alerts because there were just too many. Let me show you a better way...
2. Core Principles
Your users care about exactly four things:
- The site is up
- It's fast
- Their data is accurate and fresh
- All features work as expected
⚠️ Warning: If your alert doesn't tie directly to one of these four things, it's probably noise.
The Three Laws of Alerting
- Every alert must be actionable. If your response is "huh, that's weird" and you go back to sleep – it shouldn't be an alert.
- Monitor symptoms, not causes. When your house is on fire, you want the smoke alarm to go off, not a notification about unusual living room temperatures.
- Pages must be urgent AND important. If it can wait until morning, it's not a page.
🎯 Pro Tip: Print these three laws. Put them next to your monitoring dashboards. Reference them every time you create a new alert.
Theory is great, but let's look at how this plays out in the real world. I've helped transform alerting systems at several companies, and these examples will show you exactly what changed and why it worked.
3. Real-world Transformations
Case Study 1: The E-commerce Platform
Before (20 alerts/day):
- Database connections > 80%
- Payment service CPU > 90%
- Cache hit rate < 70%
- Queue length > 1000
- Worker latency > 200ms
After (3 alerts/day):
- Checkout success rate < 98%
- Payment completion time > 3s
- Cart abandonment rate spike > 15%
Results:
- Alert volume: ⬇️ 85%
- Incident detection: ⬆️ 40% faster
- False positives: ⬇️ 90%
But wait, it gets better. While the e-commerce example is impressive, here's what happened when we applied the same principles to a completely different type of service...
Case Study 2: The Media Streaming Service
Before (15 alerts/day):
- CDN error rate > 1%
- Video encoder CPU > 85%
- Storage latency > 50ms
- Transcoder queue > 500
After (2 alerts/day):
- Video start success rate in San Francisco < 99.5%
- Playback buffer ratio in Seattle > 0.1%
Results:
- User complaints: ⬇️ 60%
- Time to resolution: ⬇️ 45%
- Engineer burnout: Significantly reduced
🔍 Key Learning: Notice how each "After" alert directly maps to user experience?
Now that you've seen what's possible, you're probably wondering, "How do I get there?" Don't worry, I've got you covered. Here's your week-by-week playbook for alert transformation...
4. Implementation Guide
Week 1: The Alert Transformation Plan
Day 1-2: Audit & Analysis
- [ ] List all alerts that fired last month
- [ ] Categorize by user impact (None, Low, Medium, High)
- [ ] Calculate false positive rates
- [ ] Identify redundant alerts
Day 3-4: Alert Redesign
- [ ] Map causes to symptoms
- [ ] Design new symptom-based alerts
- [ ] Create decision trees for escalation
- [ ] Document expected actions
Day 5-7: Implementation & Validation
- [ ] Deploy new alerts in parallel
- [ ] Validate coverage (Use last9.io coverage tool to ensure its discovering new elements)
- [ ] Train team on new approach
- [ ] Remove old alerts
💡 Quick Win: Start with your noisiest alerts first. Quick victories build momentum.
Alert Decision Tree
Alerts need a response
Response | SRE Book | Delivered to | Based on |
Immediate | Alerts | Pager | Symptoms |
Act eventually | Tickets | Issue Tracker / Email | Symptoms or Causes |
None (Diagnostics Only) | Logs | Dashboards | Causes |
Even with the best intentions, certain alert patterns keep cropping up like weeds in a garden. Let's look at the most common ones I've encountered and how to root them out permanently...
5. Troubleshooting Common Pitfalls
Infrastructure Obsession Patterns
1. The CPU Alert Trap
Before: Alert on CPU > 80%
- Wakes you up when batch jobs run
- Triggers during normal traffic spikes
- Doesn't catch real problems
Problem:
- Modern CPUs are designed to run at high utilization
- Auto-scaling systems make this metric less relevant (ensure that is setup correctly)
- No direct correlation with user experience
- Different services have different CPU patterns
Better Approach:
- Alert on service latency increase
- Monitor error rates
- Track request success rates
- Set up Golden Signals monitoring
2. The Memory Usage Maze
Before: Alert on memory > 90%
- JVM applications look like they're always about to crash
- Triggers during normal garbage collection
- Causes panic over normal behavior
Problem:
- Modern systems handle memory dynamically
- Garbage collection makes usage patterns complex
- Different applications have different memory profiles
- High memory usage often means efficient caching
Better Approach:
- Alert on OOM events
- Monitor application errors
- Track swap usage trends
- Watch for memory leak patterns
📊 One team reduced their memory-related alerts by 95% by focusing on application errors instead of usage percentages.
3. The Disk Space Dilemma
Before: Alert on disk > 85%
- Constant alerts on logging servers
- Weekly alerts on backup systems
- Never catches real problems in time
Problem:
- Different volumes have different growth patterns
- Percentage-based alerts don't consider volume size
- Some services need high utilization
- Doesn't account for cleanup jobs
Better Approach:
- Alert on projected fill time < 4 hours
- Monitor write failures
- Track growth rate changes
- Set up automated cleanup processes
4. The Connection Pool Panic
Before: Alert on DB connections > 80%
- Triggers during normal peak hours
- Misses actual connection problems
- Causes unnecessary scaling
Problem:
- Connection pools are meant to be utilized
- Percentage-based thresholds ignore pool size
- Different services have different usage patterns
- Misses real connection issues
Better Approach:
- Monitor connection timeouts
- Track query latency
- Alert on connection errors
- Watch for connection leaks
5. The Queue Length Obsession
Before: Alert on queue length > 1000
- Ignores processing rate
- Triggers during normal spikes
- Misses actual processing problems
Problem:
- Queue length alone is meaningless
- Different queues have different patterns
- Normal backlog isn't always bad
- Doesn't catch stuck processors
Better Approach:
- Monitor processing rate changes
- Alert on time-in-queue
- Track completion rates
- Watch for stuck messages
6. The Error Rate Rabbit Hole
Before: Alert on error rate > 1%
- Triggers during normal retries
- Ignores error impact
- Causes alert fatigue
Problem:
- Not all errors are equal
- Some errors are expected
- Percentage doesn't consider volume
- Misses critical errors in low-traffic periods
Better Approach:
- Alert on critical error patterns
- Monitor error impact on users
- Track error rate changes
- Focus on non-recoverable errors
⚠️ Warning Signs You're Doing It Wrong:Your alerts trigger at the same time every dayNobody responds to alerts immediately anymoreYou have more alerts than servicesYour runbooks are longer than your code
7. The Latency Layer Cake
Before: Alert on latency > 100ms at every layer
- Database latency alerts
- API latency alerts
- Cache latency alerts
- Network latency alerts
Problem:
- Creates alert storms
- Ignores system interactions
- Misses real user impact
- Causes duplicate investigations
Better Approach:
- Monitor end-to-end latency
- Track latency changes
- Alert on user experience
- Use latency budgets
How to Fix These Anti-patterns:
- Audit Your Current Alerts
For each alert, ask:-
Has this ever caught a real issue?-
How many false positives last month?-
What action was taken when it fired?-
Could we catch this another way?
- Transform Your Alerts
Move from:-
Resource usage alerts-
Individual component alerts-
Static thresholds
To:-
User impact alerts-
System behavior alerts-
Dynamic baselines
Set Up Proper Monitoring
Implement:-
Golden Signals monitoring-
SLO-based alerts-
Trend analysis-
Capacity planning
Remember:
- Every alert should have a clear action
- If you're not going to wake up for it, it's not an alert
- Dashboards are for patterns, alerts are for problems
- The best alert is the one you never need to send
🎯 Pro Tip: Keep a "graveyard dashboard" of deleted alerts with metrics showing how many false positives you avoided. It helps justify the changes to everyone.
6. Maintenance and Evolution
Getting your alerts under control is just the beginning. Here's how to keep them that way...
Monthly Checklist
- [ ] Review alert frequency
- [ ] Check false positive rates
- [ ] Update thresholds based on business changes
- [ ] Remove unused alerts
- [ ] Operational Review include 'deleted alerts' just like 'deleted code'
Quarterly Review
- [ ] Full alert audit
- [ ] Update runbooks
- [ ] Team feedback session
- [ ] Adjust based on post-mortems
Key Metrics to Track
- Alert-to-incident ratio
- Mean time to acknowledge
- False positive rate
- Alert frequency per service
🎯 Pro Tip: Keep a "graveyard" dashboard of deleted alerts. It's motivating to see how far you've come.
The Final Word
Your monitoring system should be like a good assistant: it only interrupts you for things that truly matter. Everything else can wait for office hours.
Action Items For Tomorrow
- Count your current alerts
- Pick your noisiest alert
- Map it to user impact
- Transform or delete it
- Measure the results
Remember:
- Users don't care about your infrastructure
- They care about their experience
- Monitor what they care about
- Use causes for diagnosis, not pages
May your pager be quiet, and your sleep be deep.
Written by someone who's deleted more alerts than they've created, and sleeps better for it.
We celebrate when code is deleted regularly, we should also celebrate when alerts are too.