A version of this story was first published on MoneyControl.
Engineering leaders simply have little to no visibility of the surplus costs around internal technology infrastructure tooling. The lack of awareness around bloated infrastructure and its associated costs is not just staggering but underscores how much ‘technology’ under the hood is a source of avoidable expenditures.
To roll this back, identify blind spots, and slash costs, is a herculean task, albeit an inevitable one given the climate we find ourselves in. But where does one start? Rather, how much does an engineering leader know about their costs, and how to revisit them? That’s harder to gauge than one might assume.
A typical unicorn startup will be spending a minimum of $5,000,000 every year on all tooling. (In conversations I’ve had, that’s a conservative number.) From cloud storage costs to monitoring and project management tools, the costs to run a startup at that scale grows disproportionately over time. A large chunk of this is because of what I think of as ‘Arctic’ tooling.
Arctic tooling is the software your business essentials are built on. But, the visibility and pain of understanding its inherent costs are fairly complex. It’s not as simple as your Google Suite costs or internal Slack messaging costs. This is about AWS storage costs, the bread and butter of engineering. The thermal insulation to your arctic winter. Costs you simply can’t do away with. (but can definitely reduce with engineering efforts)
The painstaking efforts to understand these costs and bring them down are hard. You couple this with a lack of awareness, and we have a pandemic of poor infrastructure and massive bills at the end of every month. You’ll see this chatter online — CTOs bemoaning their large AWS bills and how they simply have no clue what’s biting them. There’s a reason for this, and that’s poor ‘Observability’ practices.
What is Observability?
The measurement and attribution of performance in a complex software environment are called Observability. ‘Performance’ of software comes in various shapes and sizes. For example, latency. In simple words, it’s the time taken for a mobile app to load.
The immediate one that comes to mind — One Time Password. Those agonizing 30+ seconds you have to wait for an OTP to arrive in your messages so you can complete a transaction — that’s a form of latency. The 2 seconds it takes for a food delivery app to populate available dishes in a restaurant — latency.
I know what you’re thinking. Indeed, a two-second latency is not bad! Well, think of gamers. A 2-second latency is an eternity when you’re playing a shooting game. In fact, at 100 milliseconds, a game is unplayable. How does one fix this? Well, one has to first ‘observe’ that this phenomenon happens to users. Then you have to dissect it; is it for all users? Maybe users only in a certain region? Maybe users with a certain kind of phone. And the list goes on…
The practice of understanding the performance of the software is complicated. Like latencies, we have concurrent users (think 30+ million people wanting to watch an India vs Pakistan cricket match), 5xx errors (when a system can’t fulfill a request), etc… Again, the list is endless.
If we categorize these into RED metrics, we can gauge the key ones that impact a business. And mind you, these have a direct impact on revenues. A frustrated user might choose to book a cab or order food from another provider. In fact, I'd contend that latencies are the new downtimes. If it’s slow — a consumer will bounce. They will simply choose other alternatives.
Fun fact: More than 10 years back, Amazon found that a 100ms latency cost them 1% in sales. Oh, you know why Google is so fast? Because they found that an extra 500ms to show search results cost them a 20% drop in traffic. Latencies are… hard problems.
The hidden C.O.S.T. of Instrumentation
Tech companies are losing staggering amounts of money by poor instrumentation. It’s an untold story because no one knows how deep the rabbit hole goes. Leaders are grappling with this phenomenon now as funding dries, and pressures mount on controlling expenses.
All tech comes under four key instrumentations:
- Cardinality: This is nothing but an explosion of data; data that you may not need and are probably not necessary but still pay for. The funny thing is, you don’t just pay extra for what you don’t use; you pay extra because of what you don’t use.
- Operations: Managing all the tech comes with talking to vendors, looking at their uptimes, ensuring security and compliances, and above all, ensuring they have fallback mechanisms, etc…
- Scale: As you get more users, and start witnessing massive growth, you have to act on the necessary distractions of scale. This is when your tech infra has to grow to support more users and their time on your application.
- Toil: All the effort to manage your Cardinality, Operations, and Scale manifest in Toil. This is when you have engineering overhead costs and allocate time to instrument your overall infrastructure. It’s an endless never-ending hassle.
Modern applications are like living organisms - they grow over time and sometimes at alarming rates. Their multiplication is hard to understand, and when you’re focusing on the core business, you’re ignoring the underlying belly of what supports that business.
This is where I believe there are massive improvements to be made. Organizations need to go through the strenuous exercise of questioning their data storage costs, and purging data they don’t need. So much data is unnecessary, and it piles on costs for a business.
The long winter is here, and if you don’t want the inevitable frostbite, audit your entire internal SaaS environment.
Want to know more about Last9 and our products? Check out last9.io; we're building Reliability tools to make running systems at scale, fun, and embarrassingly easy. 🟢