Jul 11th, ‘23/4 min read

What Site Reliability Engineering Needs: A Swarm of Bees

If all companies are software companies, all companies need better Observability to understand how performative their software is

What Site Reliability Engineering Needs: A Swarm of Bees

Every company is a software company. That has to be one of the most boring and overused phrases. We say this often as practitioners in the industry, but I truly felt what this means during a recent visit to Singapore and Jakarta for SRECon APAC. I met a whole bunch of folks there and realized how reliant companies are on software.

I wrote about interesting talks from SRECon APAC.

I met a leader from a beverage company who was obsessed over his reliance on supply chain software to determine inventory spikes at select locations. I met a food entrepreneur trying to assemble his own fleet of driver partners to deliver food and wanted an app that geo-fences folks within a 4km radius. And the stories go on.

What struck me was the need for software, yet the little know-how on its boundaries and implications. But these are stories for another day.

One of my takeaways from these conversations was the increasing reliance from leaders on software and the importance of making meaning out of their data. To do this, you need data exploration. Without being able to query your data and be able to do it fast and cheaply, you’re merely collecting data and paying exorbitant storage costs.

What good is data if you’re not funneling it into business, empowering Data Science and Research teams to make decisions on features, and better shaping the product? All of this is impossible without understanding the boundaries under which software operates currently and how we can improve this.

With dependence on software to be performative, you need Observability to keep track of rogue behaviour. Good, or bad software is determined only by its measurement.

Without measuring how your software stack is performing, it’s impossible to gauge your work's efficacy. Ipso facto, all software companies need better Observability.

But how do you get better at building Observability?

As someone who has dabbled in this space for half a decade now, working with large companies and tiny startups, I have some strong opinions. Chief among them is how orgs are structured.

Here’s an observation related to this industry — companies with a technical founder at the helm tend to obsess over the little details on infra. And by virtue, costs become important. When you don’t have technical founders unless there’s a champion employee with high ownership bias, the tech stack is like throwing money at the problem instead of focusing on root issues.

Bear with me as I try to flesh out this thinking.

💡 Observability is about costs, and if you can’t guarantee a reduction in Total Costs of Ownership (TCO), you’re missing the biggest piece in the Reliability puzzle

Remember the adage, “No one ever got fired for buying IBM”? You’ll find that doctrine rampant in the Site Reliability Engineering space.

People work with what they’re familiar with.
Engineering leaders generally don’t tend to make bold, big tech stack decisions. Engineering leaders prefer the norm and seldom stray away from the status quo.
Engineering leaders don’t make critical cost decisions because these have a price to pay in the form of job security. (I’m wholly cognisant this is a large generalization, but over time, these patterns are becoming clearer in the little bubble of companies I’ve spoken with.) This is very different from how Technical founders work.

If this seems like a vast generalization, hold on, this is where there are enough companies who have what I call “Rogue Bees”. And my point of contention: We need more rogue bees. But what are these rogue bees?

The Criticality of Rogue Bees

Rogue bees are what keep our ecosystem alive in so many ways. What are they? A select bunch of bees tend to stray from the main hive, searching for new food and space.

This anomalous “rogue” behavior keeps existing bees alive and gives the rogue ones the opportunity to explore new territory in search of honey. Two things happen when rogue bees rebel against the norm:

  1. It saves the existing swarm; otherwise, they will only stick to one area and, in the process, eventually run out of food and die
  2. It helps pollinate crops in different areas and gives us flowering plants crucial to the planet's survival.

Without these rogue bees, we’d all be doomed. They are critical to the very survival of the human race.

Ok, but what does this mean for Site Reliability Engineering?

The SRE space is crying out for more rogue bees. Without rogue engineers in an organization, getting better at Observability would be hard.

If you’re still stuck with “my software is crashing” problems, you don’t have enough rogue engineers in the org. The industry is moving on to more nuanced problems around scale, latency, throughputs, and 5xx errors. App crashes should be a 2015 problem.

Cloud-native companies need to start dabbling in Artificial Intelligence and its applicability in solving some of these more nuanced problems.

I’ve been critical of AIOps and these other loosely banished terms. But given the strides made in this space in the last 6-months alone, there’s enough scope to get better at helping customers with minimal latency, among other issues.

But, without enough rogue engineers in an organization, it’s going to be a tall order to compete with peers who’re tackling exciting challenges for their customers.

I’ll wrap this rant with a conversation I had with my taxi driver in Singapore. I asked him why he prefers using x app over y. His answer, ‘The y app is always slow, and never picks-up the right location for a customer’s pick-up’.

Does your org have rogue engineers who want to tackle interesting challenges and obsess over costs? There’s never been a more critical time to ask that question. ✌️

💡
The Last9 promise — We will reduce your TCO by about 50%. Our managed time series database data warehouse, Levitate, comes with streaming aggregation, data tiering, and the ability to manage high cardinality. If this sounds interesting, talk to us.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors

Aniket Rao

http://1x.engineer @last9io 💻 Programmer | 🌌 Astrophile | 🎮 FIFA Player |🏌️‍♂️Amateur Golfer

Handcrafted Related Posts