Battling Alert Fatigue

Alert fatigue occurs when your team gets exhausted by the sheer volume of alerts they receive. This results in teams becoming indifferent to alerts or burning out trying to resolve them all. In all, it's a terrible sign, highlighting the failures of your monitoring and alerting strategy.

Still, your team needs to know when your users are affected by technical issues to correct these issues quickly. This is especially important for SRE and DevOps teams as they are the front line in most organizations. Problems that aren’t addressed can lead to poor user experience (UX) and damage your company’s reputation. Proper alert management means finding a balance between offering the best possible service to your users and keeping your developers focused on growing your business with the least interruption.

In this article, you’ll learn what alert fatigue is and techniques to reduce alert fatigue to protect your team from the mental strain of frequent context switching and the stress of being in constant firefighting mode.

What is Alert Fatigue

Alert fatigue, also called alarm fatigue, occurs when an excessive volume of alerts causes desensitization among individuals responsible for addressing them. This desensitization leads to disregarded or delayed responses to alerts, ultimately resulting in crucial alerts being overlooked.

The primary issue lies in the overwhelming quantity of alerts. A single alert can be promptly addressed, even if it disrupts an on-call employee's regular workflow or leisure time. However, the task becomes more challenging when a dozen alerts follow in quick succession. As the number of alerts increases, employees will likely miss something important.

The prevalence of false alarms further exacerbates this problem. Research indicates that anywhere from 72% to 99% of all clinical alarms are false in the medical field. Similarly, in security, a survey revealed that 52% of alerts were false, and 64% were redundant.

The sheer volume of false alerts trains workers to assume that the majority of alerts will be false, leading them to act accordingly. Just like the doctor and pharmacist mentioned earlier, they dismissed the system overdose alert, assuming it was another insignificant alarm.

11 Techniques to Reduce Alert Fatigue

The following are tips and best practices to keep your team from dealing with alert fatigue.

Use an On-Call Rotation

Instead of bombarding the entire team, balance the load between team members. Team members should take turns on call, usually each week, depending on the best schedule. When the person on call receives an alert, they’re responsible for all actions needed to resolve it, including opening an incident and involving other team members.

The advantage of this approach is that it takes the pressure off the rest of the team, ensuring that they can focus on deeper tasks.

This change in the team’s schedule often requires some logistical adjustments. For example, it can be disruptive for the whole team to receive notifications via a Slack channel or group email, so using text messages or push notifications might be a better solution.

Create Escalation Policies

What if the person on-call isn’t available or an alert is accidentally missed? It’s inevitable that some alerts won’t get addressed, even if you apply all other best practices. As a result, you need to account for those what-if scenarios and have a backup solution.

Escalation tells the alerting system what to do if the alert isn’t resolved after a set time. You can prevent different levels of alert fatigue by combining an on-call rotation with escalation policies.

You need to be careful when configuring your alert policies. You might send alerts to somebody higher up the escalation chain before they’re supposed to receive it. When escalations are sent too quickly, the next person on the chain can stress out. Escalation is meant to deal with stalled alerts and should not aggravate the alert fatigue situation.

Also, it could be tempting to add your engineering manager or VPs on the last level in an escalation policy. They most likely don't want to be bothered until an alert is urgent. This is another reason to plan your escalation policy properly so that they’re not pulled into the incident thread sooner than needed.

Prioritize Alerts

Not all alerts are equal. Creating different alert levels is important to prioritize the most important ones and postpone the others. For instance, a notification that 80 percent of the database disk is being used is less crucial than if 90 percent were being used. The second alert requires immediate attention, while the first is a friendly reminder to take action later in the day.

Differentiating between emergency and non-emergency alerts is essential. You can set up additional priority levels, often creating a classification problem. To keep things clearer, try filtering alerts by those two categories. You could also consider delaying less important alerts if an important one occurs so that the person on-call can focus on one alert at a time.

Use a Well-Calculated Threshold

In essence, alert fatigue comes from experiencing a large number of alerts frequently. If you find yourself getting deluged with alerts, a simple way to reduce this is by better configuring the threshold.

Not every error needs immediate intervention. Short bursts of errors aren’t important since they resolve themselves, and errors always there feel like false positives, discouraging the person on call from investigating further. In order to reduce alert fatigue, you need to consider these factors in your threshold calculation.

For instance, you can use time buckets. Time buckets are an interval (say, one second) in which you perform a mathematical operation such as a sum or average. Instead of simply looking for a flat threshold that would detect occasional spikes, build alert thresholds based on time buckets that focus on errors occurring over a larger period of time.

Finally, continuously review errors and exclude known errors from your calculation, whether or not you plan to fix them.

Use Statistical Analysis

You can’t have truly smart alerts with simple thresholds. If you want to use your data best and reduce flaky alerts, you need to analyze historical data. This approach seems more complex, but it’s worth your while. Platforms are increasingly offering smart thresholds based on this kind of analysis.

The first step is to identify your system's typical behavior and develop a model that defines how it responds. You need to figure out the frequency of your errors at a given time and use this data in your calculations for thresholds.

For instance, you could consider the occurrence rate of HTTP 500 errors in percentages at a given time and compare it to the mean value in the previous days, plus or minus three times the standard deviation. This approach is part of Statistical Process Monitoring (SPM), or checking the stability of the process, and it’s a powerful and underutilized tool to eliminate alert fatigue.

Center Alerts around User Impact

The whole point of alerts is to support your users. You want to ensure they’re not dealing with bugs and infrastructure issues. Don’t let yourself be bothered by alerts unless a certain proportion of users are affected by some metric degradation.

This is the philosophy introduced by Google in its site reliability engineering (SRE) handbook. To follow its approach to monitoring and alerting, you need to define service level objectives (SLOs). An example SLO might be that 95 percent of user login requests must be processed in under 200 ms.

SLOs are composed of two elements: an indicator (here representing slowness in the system) and an error budget (a percentage). In the context of SLOs, the indicators are also called Service Level Indicators. Unless the budget is consumed, the teams responsible for the login shouldn’t get an alert because most users are not impacted. However, monitoring the degradation of an SLO (but not a violation) and alerting product teams separately from the typical on-call rotation can help you proactively avoid a deterioration in the quality of service.

Implement Custom Metrics

The data you can get by default from your application is often limited (CPU, RAM, error logs, etc.) and tells you very little about the inner workings of your application. This leads to two problems: you don’t have much information about the root cause of a problem, and the alert threshold is hard to define because it doesn’t represent tangible quantities.

Implementing custom metrics in your application will give you more granular data, which can help pinpoint the root cause of a problem more quickly than when you’re relying on generic resources such as response time or requests per second. A custom metric could be the latency of process steps (such as image processing) or the conversion rate of a given page. If no one is using a feature like a call to action button, that could mean you have a problem.

When an issue arises, you know what subsystem is impacted and can get more detailed context about the problem's urgency.

Keep Alerts Actionable

There is nothing more stressful than seeing an alert pop in and not knowing what to do. Vague and non-actionable alerts inevitably lead to alert fatigue because the person on call can’t resolve them, gives up, and ignores the alerts.

When adding alerts, create an action plan to address them so that everyone on your team knows what to do when that alert pops up. The easiest plan of action is to add links in the alert pointing to all the relevant resources (dashboard, GitHub repository, etc.) for quick action. You can also make it easier to resolve alerts if you create runbooks, which are step-by-step instructions including scripts to troubleshoot various issues.

Reduce Duplicate Alerts

Eliminating alert fatigue means reducing the frequency of alerts, especially identical alerts. Multiple alerts raised by the same rule (for example, the same metrics) should be combined into one alert. If your team receives an identical alert and the first alert is resolved, you might want to set a delay before it can retrigger, perhaps putting the alert in sleep mode for ten to fifteen minutes. That way you reduce your team’s frustration over getting an alert too frequently; it shouldn’t require immediate attention if they just dealt with the issue.

When a service has issues, many dependent systems can be affected and raise their own alerts quickly, leading to a chaotic situation where too many people are paged simultaneously. Avoiding duplicate alerts involves understanding the dependencies and grouping them on a service level, not at every component involved.

Let your team know about duplicate alert settings (number of alerts per day or week, for instance) so they can review the alerts and adjust the threshold if necessary. Reducing alert duplication is a preventative measure, but this can usually be solved with one of the tips listed above, such as using SLOs, SPM, or custom metrics.

Create Alert Lifecycles

Alerts need to have a story and a purpose—they shouldn’t just be annoying noise that pops up throughout the day. Alerts also shouldn’t accumulate indefinitely in your system.

You can define a workflow and link between your bug tickets and alerts. When an alert is linked to a bug, mute that alert for a certain number of days until the bug is supposed to be resolved.

In other words, you should define how long an alert lives and when you can mute it, and you should create cleanup tasks to delete older alerts that are no longer relevant. Plan monthly reviews with your team or between each on-call rotation to review the errors that occurred the most and gave your team fatigue.

Create Runbooks and Postmortems

If your team doesn’t know what to do when a problem occurs, this will increase their stress level and lead to inefficient responses. To reduce the mental strain on your team, create a runbook procedure that tells them what to do in the event of an incident.

Whenever creating a new alert, you should also document what is expected from the person on-call. Provide all relevant information, such as system diagrams (what components are involved), links to dashboards and logs, steps to resolve the problem, and who to call if the resolution procedure doesn’t work.

Ultimately, you can’t manage alerts effectively unless you draw on knowledge gained from previous incidents. This means you should also document all incidents in a postmortem so your team knows what work has been done in the past, and you can identify what needs to be updated in your runbooks.

Postmortems and runbooks work together, helping you and your team feel confident that you are constantly improving your system and its reliability.

Conclusion

DevOps engineers have options for reducing alert fatigue and helping their teams feel less burdened throughout the day. Basically, battling alert fatigue is based on reducing the quantity (how many) and frequency (how often).

To reduce the quantity, you should apply tips that improve the quality of your alerts, such as reducing duplicate alerts, using a well-calculated threshold, applying SPM, creating user-centered alerts with SLOs, and implementing custom metrics.

To reduce the frequency, prioritize alerts, create an on-call rotation and escalation policy, set up an alert lifecycle, and use frequent reviews to improve the process.

Combining these different practices will make your alert systems more efficient and give you a happier, more productive team.

Want to know more about Last9 and our products? Check out last9.io; we're building reliability tools to make running systems at scale, fun, and embarrassingly easy. 🟢