Apr 29th, ‘24/6 min read

Back to the Future: The R-C-A of alerting

Dissecting the RCA of Alerting - Reliability, Correlations, Actionability

Back to the Future: The R-C-A of alerting

The last decade has witnessed a profound transformation in cloud technology and software monitoring frameworks. 

The evolution from rudimentary virtual machines to advanced orchestration with Docker and Kubernetes symbolizes a significant leap towards agile, DevOps-centric methodologies. This shift has expanded the scope of monitoring, incorporating an exhaustive array of tools covering metrics, logs, and traces. 

However, this broadening has introduced its own set of challenges. Primary among them is High Cardinality—high granularity data—which is essential for effective monitoring. 

Engineering teams are now faced with a deluge of data, and this overwhelms not just human cognition, but systems that attempt to tackle them.

The north star metric of a monitoring system is, “Time to Recover”. However, extracting effective operational insights from this data is a huge challenge. Existing tools are plagued with problems that prevent teams from reducing the Mean Time to Detect (MTTD) problems.

False positives, false negatives, and alert fatigue

Existing alerting tools are not built for the complexity and dynamism inherent in today’s cloud infrastructures. This frequently results in a barrage of irrelevant, or false alerts. 

Tools also often operate in isolated domains—be it infrastructure, application, or business monitoring—hindering a unified view, and leading to significant alert noise. 

This disjointed approach fails to provide a comprehensive picture. Teams inadvertently elevate every alert to a high-priority status. The ensuing lack of contextual clarity and effective correlation forces DevOps teams into cross-team, time-consuming triage processes. Eventually, this results in alert fatigue, and ultimately, bloating MTTDs.

While working with some of our customers, we have also come across inaccuracies in popular alerting software and products, especially at high loads. 

Ironically, high-load situations are the precise scenario where you need to lean heavily on your alerting systems. 😣 

💡Alerts that did not fire, are silent killers. 

Your alerting stack has to be the most reliable part of your infrastructure.

Based on the above observations across the industry, we decided to go back to the drawing board to build an alerting engine from the ground up. 

Our RCA gave rise to the core tenets on which our alerting tool is built. Funnily, the tenets that emerged, we call the R-C-A tenets.

🚨 Introducing, Alert Studio 🚨

Alert Studio is Last9’s end-to-end alerting tool built to tackle High Cardinality use cases. It comes with advanced monitoring capabilities such as, ‘Change Intelligence’, ‘Anomalous Pattern Detection’, and is ‘War-room ready’. It’s specially designed to reduce alert fatigue, and improve the Meant Time To Detect (MTTD) issues. 

Here’s the kicker: Alert Studio is fully PromQL-compatible. It’s compatible with Prometheus alertmanager, and you can choose to migrate alertmanager configurations or Grafana alerting.

Alert Studio combines these 3 key capabilities into one tool. It will be freely available with Levitate — our managed Time Series Data Warehouse.

I've written a post launching Alert Studio and our thinking behind it:

Launching Alert Studio | Last9
Modern monitoring systems depend heavily on ‘Alerting’ to reduce the Mean Time to Detect (MTTD) faulty systems. But, alerting hasn’t evolved to meet the demands of modern architectures. We’re changing that with Alert Studio.

Introducing the R-C-A tenets and Levitate Alert Studio

Any alerting system must have the “R-C-A” properties of Reliability, Correlation, and Actionability

  • Reliability: An alerting system must be built to function reliably at scale. Only at scale do you encounter problems that truly impact your revenue. It’s also when most alerting tools falter and fail.
    • How we do this: As a user of Alert Studio, the proof is always in front of your eyes. Alert Studio exposes the runtime metrics of the alert evaluation engine to the user. As a user, you have a real-time view of the volumes, errors, and delays of all your alert evaluations. You have absolute surety that a green dashboard really means your systems are fine.
  • Correlations: Alerts delivered and consumed in silos often cause more problems than they solve. For effective triaging, it’s important to see the alerts as part of a larger picture.
    • How we do this: Alert Studio ships with first-class support for correlating the alerts and health of components. These are grouped by service, geo-region, etc… relevant to users . The ability to quickly categorize alerts in the urgent/important matrix is crucial for fast mitigations of problems.In addition, every alert carries an implicit question — “What changed?”. Having correlated change-events information like deployments or config changes is a game changer when it comes to reducing MTTD.
  • Actionability: The industry is stuck in a conundrum. Static thresholds do not adapt well to varying traffic. Machine learning-based alerts are complex and lack accuracy. The middle ground that does work is anomalous pattern matching algorithms built into the query layer, which can be easily tested and tuned before deploying them.
    • Alert Studio configures your alerting system to detect sudden level changes or spikes in the signal. It’s as straightforward as setting a static threshold. Importantly, validating and troubleshooting them is equally simple. In addition, Alert Studio has deep integrations with PagerDuty and OpsGenie. This exposes rich metadata like degradation values and percentages, to make your notifications and incident workflows more actionable.

Features & Capabilities That Matter

We’ve carefully crafted capabilities and features geared to making your monitoring trustworthy. Alert Studio tackles the RCA (Reliability, Correlations, Actionability) framework to give users a comprehensive monitoring suite. 

Here’s a quick glimpse of what that looks like:

  1. Reliability — Tame high load
  • Low read latency under High Cardinality
    • Alert evaluations should not time out.
  • Evaluation metrics available to the user:
    • The user is instantly made aware of problems in evaluation, like incorrect configuration
    • The user is instantly made aware if there are delays or failures in the evaluation of alert rules.
    • Users can have a high level of assurance that there are no missed alerts and false negatives.
  • Internal monitoring of alert evaluation performance
    • Last9 places a high importance on the reliability and correctness of the alerting system itself
  1. Correlation — Change Intelligence
  • Changeboards
    • Correlated timeline of health across all components of a distributed system
      • Allows interpretation of alerts and alarms as part of the larger system context
      • Help to triage alerts into urgent vs important
      • Gives teams the insights needed to act fast on the mitigation of problems and reduce MTTR to single-digit minutes.
    • Group the health status of the components by relevant dimensions like service (or geo-region, data center, user platform, etc)
      • Organize insights by service
      • Get information on all components across application, infrastructure, and product KPIs for that service in a single view
      • Model information views as per your teams' boundaries of responsibility (Conway's law)
  • Change events
    • Capture change events like deployment, configuration change, start and stop special calendar events and correlate them with alerts
    • Eliminate tribal knowledge from the triaging, isolation, and recovery process
    • Access to relevant change information in crunch situations drastically helps reduce MTTR
  • Unified dashboard view
    • Built-in view for war rooms and NOC teams for situations where high MTTR has a very high impact on business
    • A unified view of metrics, alerts, and change events gives users all the necessary information at their fingertips to make decisions fast
    • Act fast. Prevent component failures cascading into system failures and downtimes
  1. Actionability — Reduce alert fatigue
  • Pattern-matching algorithms instead of static thresholds
    • Detect spikes, dynamic level changes, noisy neighbors
    • Integrated into the query layer for easy testing, validation, and troubleshooting
  • Deep integrations with incident management tools like Pagerduty and Opsgenie
    • Rich metadata like degradation values and percentage and semantic mapping labels
    • Enable complex escalation workflows
    • More relevant data to route the right alert, with the necessary context to the right team members
  • Python CDK and Infrastructure as Code
    • Standardise alert configurations across services and teams
    • Programmatically configure alerts and notifications with a flexible CDK. Use standard programming constructs like inheritance and composition to bring standardization to the configuration
    • Leverage the peer-review and GitOps processes to enforce hygiene, easily rollback and manage alert configurations

In subsequent posts, I will flesh out the advantages of each of these frameworks individually.

Alert Studio has a 1-month free trial. Check it out! Level up your alerting game, give your engineers the peace of mind they need to focus on your primary objectives 😉


We are hard at work bringing more useful capabilities to Alert Studio. Feel free to chat with us on our Discord or click here to stay updated on the latest and greatest across our monitoring stack.

You can also book a demo with us, or even give us feedback, suggestions et al. ✌️

Contents


Newsletter

Stay updated on the latest from Last9.

Handcrafted Related Posts