Modern monitoring systems depend heavily on ‘Alerting’ to reduce the Mean Time to Detect (MTTD) faulty systems. But, alerting hasn’t evolved to meet the demands of modern architectures. We’re changing that with Alert Studio.
At Last9, we read and talk a lot about system failures. These failures give us a glimpse of human thinking. We’ve spoken a lot about the3-Mile Island nuclear accident in 1979. It has innumerable learnings around designing safe systems.
For example: During an investigation of the incident, it was shown that there were 14 different meanings for a red light and 11 for green. Each light worked in isolation, without any context of how it fits into the system as a whole. Human cognition goes for a toss, and you have rampant tribal knowledge festering with select individuals in such scenarios.
To us, Alerting is in a similar vein. It’s chaotic, stuffed with unnecessary information, and not entirely actionable. We want to change this.
What’s not working with legacy Alerting
Three problems plague today's alerting journey — Coverage, Fatigue, and Reliability. And there are no easy answers to each of these.
Coverage: How many alerts do you set up for a microservice or a k8s? What’s deemed necessary to make your system foolproof? How do you ensure the same issues don’t repeat themselves?
Fatigue: Incessant alerts that lack relevant context. How do you ensure the same issues don’t repeat themselves? Engineers eventually ignore alerts and miss the needle in the haystack.
Reliability: We have witnessed many incorrect configurations and missed evaluations at high loads which led to important alerts never being delivered when they matter. And alerting systems by themselves don’t offer accountability.
These 3 points are rampant across the ecosystem. It makes Alerting a hassle, not an enabler. Just as the smart engineers at the 3-mile island ignored their many warnings, Alerting today falls in the same bucket. But, above all, we think this is also very much a User Experience (UX) problem that needs to be tackled with design and breadth of information.
Without further adieu, introducing…
💡 Alert Studio.
What is Last9’s Alert Studio?
Alert Studio is Last9’s end-to-end alerting tool conceptualized to tackle High Cardinality use cases. It comes with advanced capabilities such as, Change Intelligence, Anomalous Pattern Detection, and is War-Room Ready. It’s specially designed to reduce alert fatigue and improve the Meant Time To Detect (MTTD) issues.
Here’s the kicker: Alert Studio is fully PromQL-compatible. It’s compatible with Prometheus alertmanager, and you can migrate Prometheus-compatible alert rules with a single click.
Alert Studio is freely available with Levitate — our managed Time Series Data Warehouse. We’ve used years of experience dealing with system failures at scale in building Alert Studio. This has been a labor of love, and we’ve patiently addressed the 3 problems that plague the system: Coverage, Fatigue, and Reliability.
But these three problems only scratch the surface. These 3 issues are basic hygiene that was supposed to be there, regardless of the tool you select.
What makes Alert Studio different
There are about a dozen things I want to talk about. But, I will restrict these to 3 key reasons for brevity. I will address the more detailed nuances of Alert Studio in another post.
Reliable under high load: Alert Studio leverages Levitate’s cardinality optimization workflows to achieve low latencies at high cardinality. This ensures complex, multi-dimensional alert evaluations are not timed out. These evaluation metrics are also readily available to the user guaranteeing that there will be no missed alerts and false negatives. Even if there are delays or failures, users will be notified of these inconsistencies.
Above all, it ensures accountability in the system by even monitoring the performance of Alert Studio for any customer who uses us. This reinforces trust in alerting.
Change Intelligence: Alert Studio forces accountability on teams to have a better grip on disparate data across different verticals, thereby reducing costs. It empowers folks from engineering, product, and support to dissect historical data, thereby helping increase, and/or unlock new sources of revenue.
These come under 3 key features:
Changeboard.
Change events.
War-room view.
The Changeboard gives a correlated timeline of health across components of a distributed system. It allows the interpretation of alerts and alarms as part of a larger system context and helps triage alerts into urgent vs important. A Changeboard also groups the health status of the components by relevant dimensions like service (or geo-region, datacenter, user platform, etc)
Change events capture changes such as deployment, configuration change, start and stop special calendar events, and correlate them with alerts. This helps eliminate tribal knowledge from the triaging, isolation, and recovery process.
The war-room view is a built-in view for war-rooms for situations where high MTTR impacts business. This is an important feature for cross-functional correlations across teams to spot the big picture. Having a live bird’s-eye view of the whole system helps teams expedite analysis decisions needed to recover from failures.
Anomalous Pattern Detection: A good alerting tool should be able to spot repetitions and patterns. Tools these days are either limited by just static thresholds for setting alert rules or use complex models that are opaque and make troubleshooting impossible. These are both feasible in today’s cloud-native, complex environments.
Alert Studio can detect spike changes, loss of signal, baseline deviation, and show noisy neighbors. There’s a lot to unearth and talk about here. Read our documentation to get a proper glimpse of what we’re building.
Over the coming months, I’m hoping this is a significant game-changer for folks festered with incessant alerts.
The true beauty of Alert Studio is that it gives a unified view of metrics and other visualization dashboards, along with the corresponding status of components. All the important information is at your fingertips to make decisions. I’m excited to roll this out and improve legacy alerting systems that have plagued modern monitoring systems.
Alert Studio is also available as part of Levitate’s 1-month free trial. Check it out! Level up your alerting game, give your engineers the peace of mind they need to focus on your primary objectives 😉
We are hard at work bringing more useful capabilities to Alert Studio. Feel free to chat with us on our Discord or click here to stay updated on the latest and greatest across our monitoring stack.
You can also book a demo with us, or even give us feedback, suggestions et al. ✌️