Nov 15th, β€˜24/6 min read

The Practical Guide to Alert Sanity: From Chaos to Calm

Your pager just went off. Is it the CPU again? Memory? Disk space? Wrong question. Ask: Can users do their thing? That's your real alert.

The Practical Guide to Alert Sanity: From Chaos to Calm

Table of Contents

  1. Introduction: The Alert Problem
  2. Core Principles
  3. Real-world Transformations
  4. Implementation Guide
  5. Troubleshooting Common Pitfalls
  6. Maintenance and Evolution

1. Introduction: The Alert Problem

Look, we need to talk about your alerting system. You know, that thing that wakes you up at 3 AM because a CPU somewhere is running at 82% instead of the usual 80%. Yeah, that one.

πŸ’‘ Quick Win: Before reading further, check your alert history. How many alerts last month actually required immediate action?

After years of being on-call (and losing countless hours of sleep), I've learned a hard truth: most alerts are worse than useless. They're actively harmful because they train you to ignore the important stuff.

You might be thinking, "But all my alerts are important!" I thought the same thing too. Then I noticed our team was ignoring alerts because there were just too many. Let me show you a better way...

2. Core Principles

Your users care about exactly four things:

  1. The site is up
  2. It's fast
  3. Their data is accurate and fresh
  4. All features work as expected
⚠️ Warning: If your alert doesn't tie directly to one of these four things, it's probably noise.

The Three Laws of Alerting

  1. Every alert must be actionable. If your response is "huh, that's weird" and you go back to sleep – it shouldn't be an alert.
  2. Monitor symptoms, not causes. When your house is on fire, you want the smoke alarm to go off, not a notification about unusual living room temperatures.
  3. Pages must be urgent AND important. If it can wait until morning, it's not a page.
🎯 Pro Tip: Print these three laws. Put them next to your monitoring dashboards. Reference them every time you create a new alert.

Theory is great, but let's look at how this plays out in the real world. I've helped transform alerting systems at several companies, and these examples will show you exactly what changed and why it worked.

3. Real-world Transformations

Case Study 1: The E-commerce Platform

Before (20 alerts/day):
- Database connections > 80%
- Payment service CPU > 90%
- Cache hit rate < 70%
- Queue length > 1000
- Worker latency > 200ms

After (3 alerts/day):
- Checkout success rate < 98%
- Payment completion time > 3s
- Cart abandonment rate spike > 15%

Results:
- Alert volume: ⬇️ 85%
- Incident detection: ⬆️ 40% faster
- False positives: ⬇️ 90%

But wait, it gets better. While the e-commerce example is impressive, here's what happened when we applied the same principles to a completely different type of service...

Case Study 2: The Media Streaming Service

Before (15 alerts/day):
- CDN error rate > 1%
- Video encoder CPU > 85%
- Storage latency > 50ms
- Transcoder queue > 500

After (2 alerts/day):
- Video start success rate in San Francisco < 99.5%
- Playback buffer ratio in Seattle > 0.1%

Results:
- User complaints: ⬇️ 60%
- Time to resolution: ⬇️ 45%
- Engineer burnout: Significantly reduced
πŸ” Key Learning: Notice how each "After" alert directly maps to user experience?

Now that you've seen what's possible, you're probably wondering, "How do I get there?" Don't worry, I've got you covered. Here's your week-by-week playbook for alert transformation...

4. Implementation Guide

Week 1: The Alert Transformation Plan

Day 1-2: Audit & Analysis

  • [ ] List all alerts that fired last month
  • [ ] Categorize by user impact (None, Low, Medium, High)
  • [ ] Calculate false positive rates
  • [ ] Identify redundant alerts

Day 3-4: Alert Redesign

  • [ ] Map causes to symptoms
  • [ ] Design new symptom-based alerts
  • [ ] Create decision trees for escalation
  • [ ] Document expected actions

Day 5-7: Implementation & Validation

  • [ ] Deploy new alerts in parallel
  • [ ] Validate coverage (Use last9.io coverage tool to ensure its discovering new elements)
  • [ ] Train team on new approach
  • [ ] Remove old alerts
πŸ’‘ Quick Win: Start with your noisiest alerts first. Quick victories build momentum.

Alert Decision Tree

Is it user-impacting?
β”œβ”€β”€ No β†’ Not an alert
└── Yes β†’ Continue

Requires immediate action?
β”œβ”€β”€ No β†’ Dashboard or ticket
└── Yes β†’ Continue

Can wait until morning?
β”œβ”€β”€ Yes β†’ Create ticket
└── No β†’ Create alert

Alerting system flow chart

Alerts need a response

ResponseSRE BookDelivered toBased on
ImmediateAlertsPagerSymptoms
Act eventuallyTicketsIssue Tracker / EmailSymptoms or Causes
None (Diagnostics Only)LogsDashboardsCauses

Even with the best intentions, certain alert patterns keep cropping up like weeds in a garden. Let's look at the most common ones I've encountered and how to root them out permanently...

5. Troubleshooting Common Pitfalls

Infrastructure Obsession Patterns

1. The CPU Alert Trap

Before: Alert on CPU > 80%

  • Wakes you up when batch jobs run
  • Triggers during normal traffic spikes
  • Doesn't catch real problems

Problem:

  • Modern CPUs are designed to run at high utilization
  • Auto-scaling systems make this metric less relevant (ensure that is setup correctly)
  • No direct correlation with user experience
  • Different services have different CPU patterns

Better Approach:

  • Alert on service latency increase
  • Monitor error rates
  • Track request success rates
  • Set up Golden Signals monitoring

2. The Memory Usage Maze

Before: Alert on memory > 90%

  • JVM applications look like they're always about to crash
  • Triggers during normal garbage collection
  • Causes panic over normal behavior

Problem:

  • Modern systems handle memory dynamically
  • Garbage collection makes usage patterns complex
  • Different applications have different memory profiles
  • High memory usage often means efficient caching

Better Approach:

  • Alert on OOM events
  • Monitor application errors
  • Track swap usage trends
  • Watch for memory leak patterns
πŸ“Š One team reduced their memory-related alerts by 95% by focusing on application errors instead of usage percentages.

3. The Disk Space Dilemma

Before: Alert on disk > 85%

  • Constant alerts on logging servers
  • Weekly alerts on backup systems
  • Never catches real problems in time

Problem:

  • Different volumes have different growth patterns
  • Percentage-based alerts don't consider volume size
  • Some services need high utilization
  • Doesn't account for cleanup jobs

Better Approach:

  • Alert on projected fill time < 4 hours
  • Monitor write failures
  • Track growth rate changes
  • Set up automated cleanup processes

4. The Connection Pool Panic

Before: Alert on DB connections > 80%

  • Triggers during normal peak hours
  • Misses actual connection problems
  • Causes unnecessary scaling

Problem:

  • Connection pools are meant to be utilized
  • Percentage-based thresholds ignore pool size
  • Different services have different usage patterns
  • Misses real connection issues

Better Approach:

  • Monitor connection timeouts
  • Track query latency
  • Alert on connection errors
  • Watch for connection leaks

5. The Queue Length Obsession

Before: Alert on queue length > 1000

  • Ignores processing rate
  • Triggers during normal spikes
  • Misses actual processing problems

Problem:

  • Queue length alone is meaningless
  • Different queues have different patterns
  • Normal backlog isn't always bad
  • Doesn't catch stuck processors

Better Approach:

  • Monitor processing rate changes
  • Alert on time-in-queue
  • Track completion rates
  • Watch for stuck messages

6. The Error Rate Rabbit Hole

Before: Alert on error rate > 1%

  • Triggers during normal retries
  • Ignores error impact
  • Causes alert fatigue

Problem:

  • Not all errors are equal
  • Some errors are expected
  • Percentage doesn't consider volume
  • Misses critical errors in low-traffic periods

Better Approach:

  • Alert on critical error patterns
  • Monitor error impact on users
  • Track error rate changes
  • Focus on non-recoverable errors
⚠️ Warning Signs You're Doing It Wrong:Your alerts trigger at the same time every dayNobody responds to alerts immediately anymoreYou have more alerts than servicesYour runbooks are longer than your code

7. The Latency Layer Cake

Before: Alert on latency > 100ms at every layer

  • Database latency alerts
  • API latency alerts
  • Cache latency alerts
  • Network latency alerts

Problem:

  • Creates alert storms
  • Ignores system interactions
  • Misses real user impact
  • Causes duplicate investigations

Better Approach:

  • Monitor end-to-end latency
  • Track latency changes
  • Alert on user experience
  • Use latency budgets

How to Fix These Anti-patterns:

  1. Audit Your Current Alerts

For each alert, ask:
- Has this ever caught a real issue?
- How many false positives last month?
- What action was taken when it fired?
- Could we catch this another way?

  1. Transform Your Alerts

Move from:
- Resource usage alerts
- Individual component alerts
- Static thresholds

To:
- User impact alerts
- System behavior alerts
- Dynamic baselines

Set Up Proper Monitoring

Implement:
- Golden Signals monitoring
- SLO-based alerts
- Trend analysis
- Capacity planning

Remember:

  • Every alert should have a clear action
  • If you're not going to wake up for it, it's not an alert
  • Dashboards are for patterns, alerts are for problems
  • The best alert is the one you never need to send
🎯 Pro Tip: Keep a "graveyard dashboard" of deleted alerts with metrics showing how many false positives you avoided. It helps justify the changes to everyone.

6. Maintenance and Evolution

Getting your alerts under control is just the beginning. Here's how to keep them that way...

Monthly Checklist

  • [ ] Review alert frequency
  • [ ] Check false positive rates
  • [ ] Update thresholds based on business changes
  • [ ] Remove unused alerts
  • [ ] Operational Review include 'deleted alerts' just like 'deleted code'

Quarterly Review

  • [ ] Full alert audit
  • [ ] Update runbooks
  • [ ] Team feedback session
  • [ ] Adjust based on post-mortems

Key Metrics to Track

  1. Alert-to-incident ratio
  2. Mean time to acknowledge
  3. False positive rate
  4. Alert frequency per service
🎯 Pro Tip: Keep a "graveyard" dashboard of deleted alerts. It's motivating to see how far you've come.

The Final Word

Your monitoring system should be like a good assistant: it only interrupts you for things that truly matter. Everything else can wait for office hours.

Action Items For Tomorrow

  1. Count your current alerts
  2. Pick your noisiest alert
  3. Map it to user impact
  4. Transform or delete it
  5. Measure the results

Remember:

  • Users don't care about your infrastructure
  • They care about their experience
  • Monitor what they care about
  • Use causes for diagnosis, not pages

May your pager be quiet, and your sleep be deep.


Written by someone who's deleted more alerts than they've created, and sleeps better for it.

We celebrate when code is deleted regularly, we should also celebrate when alerts are too.

Newsletter

Stay updated on the latest from Last9.

Handcrafted Related Posts