Aug 23rd, ‘23/4 min read

A case for Observability outside engineering teams

Observability is being built by engineers for engineers. In reality, o11y is for all.

A case for Observability outside engineering teams

"???"

That's precisely what the computer screen flashed as the temperature in one of the reactor cores kept rising.

Three question marks. ???

The system designed to give feedback on the temperature could not accommodate the actual temperature in the reactor.

The actual temperature in the reactor was 4300 degrees.

The system was designed to only show a maximum of 700 degrees. That was the limit set by engineers.

But as folks were looking at the screen, nobody had a clue what the actuals were on that day.

If it had crossed 5000 degrees, the uranium core would’ve melted, and we would’ve witnessed a full-blown nuclear crisis that would’ve affected millions.

It’s 1979, and the lessons of the 3-mile nuclear accident resonate till date. It’s a lesson on engineering, design, cognitive psychology, and above all… Observability (011y).

Feedback is everything

Feedback tells us what’s wrong, what’s right, what to fix, and what to double up on.

At Last9, we obsess a lot over Sci-Fi, food, and… well… critical accidents. We’ve written about the 3-mile nuclear accident in 1979 and spoke about it from the prism of ‘Normal Accident’ theory. Here’s another take on the story that tells us about the importance of feedback.

Before that, a quick recap if you’re not so familiar with the 3-mile nuclear accident 👇

The incident happened in 1979. It was a partial meltdown of a nuclear reactor. It’s considered the worst accident in U.S. commercial nuclear power plant history. It cost about $1 billion to clean-up (equivalent to $2 billion in 2022) and is still ongoing till date. The decommissioning of one of the reactors is expected to be completed in 2079, at an estimated cost of $1.2 billion. Above all, it was a major setback to the growth of nuclear power, shaking the public’s confidence.

Yeah, this was a big deal.

For me, probably one of the biggest failures of the incident was the utter lack of feedback. In popular parlance, in the SRE world, we call them, “flying blind”. Without any feedback, the people in the reactor had little to no idea how to go about their job. And once panic sets-in, things just got worse.

Feedback is critical to building great products. We all take feedback routinely. From peers on a particular task, from friends while buying products, from e-commerce reviews and ratings, restaurant reviews et al. Without adequate feedback, you’re definitely flying blind.

If it’s critical to get feedback for successful teams, how are they being empowered in organizations?

The dependence on engineering = Flying blind

When an application fails, or does not load on time, or the myriad problems it faces, the first team to talk to is engineering. This happens routinely in all organizations. Engineering teams face the heat of a system ‘downtime’ first-hand.

But, engineers aren’t necessarily the first ones to be aware of a problem. Sometimes this comes from a customer who calls out an issue on social media. Sometimes Customer Care representatives get a call from customers complaining that payments are not working, or their screen is not loading etc... Awareness is not always with engineering. But debugging is. But this need not be the case.

Business and Customer Support teams should be empowered with tools to:

  1. Identify system incidents
  2. Data Exploration

You know that age-old o11y adage - I know 50% of my data is rubbish, I just don’t know which 50%. This shouldn’t just be an engineering problem. Even Business teams should have a stake in the rising costs of data storage.

We need more accountability for engineering and lesser dependence from other teams on engineering.

Let’s understand these use cases for teams.

If Customer Support teams have better visibility of critical infra, they’re empowered to troubleshoot customer problems better. When an infuriated customer calls and asks why her payment is not going through, CS teams typically respond by saying, “We will get back to you in 24 hours”. That’s because they have to rely on engineers to ultimately answer this. What if CS teams have tools to understand infra?

There’s a cost to querying data and learning what users are doing. From patterns, to behaviours, mining data is expensive, and time-consuming. This dissuades teams from really exploring their data and learning how their consumers are using an application, and what can be done to better that experience.

💡 Observability has to move from the fiefdom of engineering, to business, product, and customer teams - to explore data, understand consumers better, and pre-empt incidents. It’s not only about latency and downtimes alone.

Typically when a metric goes haywire, (say your food delivery orders are down 7% in three core areas in SF) multiple teams are questioned on what’s going on. Everyone is flying blind. It takes days to get to the bottom of the metric, and usually sits with engineering. It should not. Even worse: Only a handful of people with legacy knowledge are empowered to explore this data from latent dashboards. This should not be the case.

Observability is being built by engineers for engineers. In reality, o11y is for all. ✌️

End.


💡
The Last9 promise — We will reduce your TCO by about 50%. Our managed time series database data warehouse, Levitate, comes with streaming aggregation, data tiering, and the ability to manage high cardinality. If this sounds interesting, talk to us.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors

Aniket Rao

http://1x.engineer @last9io 💻 Programmer | 🌌 Astrophile | 🎮 FIFA Player |🏌️‍♂️Amateur Golfer

Handcrafted Related Posts