When you’re a 5-person engineering team, observability tooling is the least of your concerns. But putting it on the back burner only means prolonging the inevitable. It’s a question I get asked often,
“At which point in the maturity of an engineering org should you focus on reliability tooling?”.
“We’re flying blind for now, but diagnosing incidents is becoming harder — should I start considering tooling right away?”
I thought I’d write a whitepaper to understand how engineering organizations should build reliability and what factors to consider. And here we are.
As an engineering organization scales, your metrics grow across dimensions:
Retention
Unique metric types
Cardinality, or instances
However, modern TSDBs grow on all three axes because of the rate of ingestion and exploration. 💥
So, how do you build reliability tooling across org tiers? We’ve classified organizations into three types based on the following:
👩💻 Engineers
😍 Customers
📀 Services
Based on this classification, it’s easy to understand needs across teams. From my conversations, a point stood out: Data querying needs to grow plentiful across teams as organizations mature. So, this is not an engineering endeavor only. Business, Finance, Product, and Customer Success teams have what it takes to bring your Prometheus down. And these have broad ramifications for any organization.
Without further adieu, here’s a link to the whitepaper: 👇