Rethinking Anomaly Detection: Focus on business outcomes
From the trenches at Games24x7 — Sanjay, on how Reliability engineering should drive core business metrics
Feb 16th, ‘23 / 4 min read
A customer does not care if your database is down.
A customer does not care if your EMEA server is down.
A customer does not care if your 3rd party authentication is down.
Yet, a disproportionate amount of alerting is around servers and services – and this construct has not meaningfully changed to a top-down approach: Reliability engineering that drives core business metrics.
We need to rethink how we frame alerting, focusing on business outcomes. It's hard to instrument this for large organizations, but for folks starting up, I got advice, and hopefully this plea will convince you to rethink an important cog in your engineering organization's observability journey.
More alerts = lesser Observability, infinite pain
One of our initiatives at Games24x7 was about having the ability to “Detect Anomaly”. We wanted to move away from conventional static alerting as it relies on statistical methods which by themselves might suffer false negatives and false positives misdirecting the organization's efforts
There are three kinds of metrics one needs to track closely — infra, services, and business. Most of the monitoring tools cover infra and service metrics with a static threshold to generate alerts. This does not help in detecting any anomaly with business metrics as the KPIs influencing the metrics are very complex and unique to the organization thus rendering standard monitoring tools to detect anomaly becomes ineffective. This is what we intended to change.
We started talking to businesses about things that mattered - What are the most critical things we can give you insights on? It’s a core business metric that can make or break the customer experience. How much do we stand to lose if this particular service goes down? Once we did that, we instrumented our alerting based on anomaly accordingly. And hence with this approach, our focus shifted to enhancing accuracy and effective anomaly detection in real-time to achieve business outcomes.
A change causes every incident, and how to track and respond to the change is the challenge.
Initially, we had ~90% false alarms. However, the legit 10% gave us the ability to detect an anomaly/issue very quickly. Despite the signal-noise ratio being high, this mattered, because of the criticality of these alerts and the relative customer experience we were able to detect. This is one example of how our entire process starts and ends with business.
Every service onboarded to this new observability platform defines two levels for alerting: Warning, Critical.
Warning: Slack - the ability to detect any potential issue in an early stage.
Critical: Automated Phone Call with an escalation — in case we miss taking an action at a warning level
The standardization of services and metrics automatically happened after the initial hard work of enforcing these stringent policies. We moved from a bottom-up approach of defining metrics to a top-down process of understanding business impacts and then creating correlated actionable alerts. We have clear policy templates for new services on how to define and onboard services and define a blast radius of services to correlate events.
Accountability with alerting
Upon 1st critical alert, it auto-creates a JIRA ticket (Target State) and appends any additional alerts on the same issue to the existing Jira ticket (and alerts are muted). Our objective here was simple: Why should we get alerted about the same issue multiple times? Either we haven’t framed the alert well, or it needs a fix. The auto-creating of a ticket brings in accountability. You can miss an alert, but not an assigned ticket. Overall, this improved accountability and killed alert fatigue in the org.
We train the system during the initial stage to build the data model to define the benchmark. A system can only learn if it’s being trained, and fed with ample data. As we spent the initial days modeling the right alerting, our implicit goal was to make it business-oriented, rather than just data-oriented.
We also streamlined our incident response. For example: since we defined the blast radius for each service, we were able to pinpoint the exact degradations for upstreams/downstream for a service by correlating events (Target state). This helps us reduce the MTTD and MTTR significantly.
Instrumenting change – Drive outcomes
A good system helps improve top-line (customer retention).
A good system recovers in seconds/minutes with auto-remediation ability.
A system should be correlated to its monetary efficacy. If my EMEA server goes down, how much money do I stand to lose in a day? How about 90 minutes a month? What are my monetary dependencies on a service?
Engineering orgs rarely frame the value in such monetary terms. This is also one of my personal grouses with SREs today. In fact, we should call them Site Revenue Engineers. 😜 (Funnily, Nishant came up with the term, and I concur) This ensures the rest of the organization and management values SREs in the context of reliability as first-class citizens of what customers experience of the brand.
Truth is, we struggle to showcase our own relevance in a growing org. By inculcating a habit of doing so, you're speaking directly to the business, and translating engineering value.
If my system spots an unusual surge in traffic at 9 pm at a particular location say Ranchi, am I alerting Business about this change in a pattern? What's the anomaly here that's telling me an interesting pattern of my customers using the app at this time?
Anomaly detection and correlations such as these help spot patterns and identify any potential issues more proactively. These insights can help improve revenue, make targeted marketing budget decisions et al.
We want to get to a point where every alert in the org has obvious potential cost correlations to business. This makes our teams understand the value of the work being done. Also, highlight this across the org, and showcase the advantages of maintaining a robust infrastructure.
Our ability to highlight impact ensures our teams are solving first-class problems in the org. Quantifying this impact shall not only contribute to the business’ success but also help in laying a strong foundation. A culture of growth and innovation shall ensure the business thrives in the long run.
Good reliability tooling is not just about keeping the ship afloat, it’s also about pointing out the right navigation so the ship can be steered in the most efficient manner.
Sanjay Singh is the Head - DevSecOps at Games24x7. You can reach out to him on Linkedin
Want to know more about Last9 and our products? Check out last9.io; we're building Reliability tools to make running systems at scale, fun, and embarrassingly easy. 🟢