🏏 450 million fans watched the last IPL. What is 'Cricket Scale' for SREs? Know More

Mar 19th, ‘24/3 min read

Everything in software monitoring is dead, apparently

Chasing shiny new toys, as always ;)

Everything in software monitoring is dead, apparently

In the DevOps/SRE world, everything is dead depending on who you ask, and where you pick your morning tea from. 😛 Every week, there’s one piece dedicated to a tombstone claiming the death of DevOps, or SRE, of o11y or monitoring, or ‘Platform’ Engineering, or whatever new thing props up. It’s all dead. Then it gets resurrected. Then it gets a new spin, and a shiny new definition.

💡
2018-2021: Site Reliability is dead, embrace DevOps.
2021-2022: DevOps is dead, embrace Platform Engineering.
2023-TBD: Platform Engineering is dead, embrace AI-enabled, Reliability-focused two-pizza, 1 hamburger and 2 shakes teams.

What has really changed for the practitioners though?

You know, the ones in the trenches trying to figure out how to instrument their VM agent, or how to scale their Prometheus?

Zilch.

Buzzwords, definitions and nomenclature are plenty in this industry. Practical implications for folks struggling with how to get rid of their massive DataDog bills and taming their insane cardinality: well, let’s not talk about those, aye.

The marketing deluge around Monitoring

Don’t even get me started on research and advisory firms, and what they’re peddling. I read a report published in 2019 about what the future portends in 2021, and it was being publicised as a 2022 phenomena, that was sold in 2023. And we’re at 2024, and none of what they envisaged is even remotely true.

Then, there are these vague ones that question my sanity. Favourite among them: ‘Applied’ Observability. It’s basically Observability, but it’s being ‘applied’; as in, used in practice. Because o11y by itself was not. Apparently. Errr… Ok then.

Of course, once the phrase was being shilled, everyone got into the act. You had the SEO folks jumping onto it, and then the mini-army of marketers calling it the nirvana that would suddenly mean your Wendy’s order is delivered to you when you merely dream of nuggets. Promises galore of the magic pill that unchained your nemesis.

So many reams of this is being written and shoved on us, the reality of the Site Reliability Engineer is merely fundamental, and often mundane; ‘How do I instrument this TSDB’, ‘Why does this alert keep going off’, ‘Why does this query crash my dashboard’.. et al. The simple things need fixes, not ‘applied’ o11y, whatever in the world that means.

Monitoring Systems is hard

Monitoring the performance of complex distributed systems is hard. It requires time, patience and efforts. It requires an engineering mindset that’s not focused on glory. Your business outcomes are prolonged, and gratification is delayed.

💡 You’re most likely not going to be noticed if nothing breaks. And that’s what good Site Reliability Engineers do; be invisible to the rest of the org.

We’re missing the basics of what it takes to be good at the job. Instrumenting metrics, improving coverage of monitoring systems, automating the usual repetitive tasks, fixing alert fatigue etc… A lot of this is boring, and frankly… annoying to do.

It’s also why I passionately advocate orgs to outsource the headaches of managing monitoring. Scaling a TSDB in itself is a hassle, let alone alerting, visualization and all the other pieces that need to come together.

Monitoring should be a vendor’s headache, so as an engineering org, one can focus on building the product.

It’s 2024, and if you’re reading this in 2025, I can almost be certain that these problems will still persist. Because marketing hype around monitoring has gobbled up much of the mindspace of what it takes to build monitoring systems.

But, above all, I can’t wait to see what 2025 throws at us. My bet: “Artificial Intelligence enabled Reliability Engineering” 💀


If you want to understand how Levitate (Our managed Time Series DataWarehouse) differs from the rest, and how it can manage what we call  â€œCricket Scale”, please feel free to DM me. You can also schedule a demo.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors

Aniket Rao

http://1x.engineer @last9io 💻 Programmer | 🌌 Astrophile | 🎮 FIFA Player |🏌️‍♂️Amateur Golfer

Handcrafted Related Posts