Observability is a practice, not a job

An excellent way to understand an organization’s engineering quality is to read its Observability (o11y) stack. Because if you have to deploy consistently, and ship new features, you need better o11y.

I use this as something of a framework while evaluating companies, advising folks who are taking up new jobs, and generally understanding the quality of engineering in an organization. I’m obviously biased here, but this is a short, quick post summarising how I think about o11y’s relevance and its standing in any organization.

The entropy of o11y

By definition, o11y gets stale fast in organizations that are moving fast. Let’s probe.

The Key Responsibilities of an engineer are measured by how fast they’re able to ship what’s on their plate. Of course, engineers need to own the uptime of their code.

Engineers are constantly tweaking the Product, taking feedback from Business, and shipping new features. The cycle never ends, and usually, startups are a great example to understand this.

When engineers are operating in these harsh software environments, entropy is a given. Systems change every week as the velocity of engineering picks steam. There’s only one way to mitigate this entropy; knowing when to revert the system to its previous state when things break.

The ‘knowing when to revert’ is what makes engineering agile. The more aware you are of a system and its underlying components, it gets easier to revert to a stable state. It’s how foundational engineering principles are built.

o11y as a mindset

The randomness in a system’s behavior means an organization must have o11y tools. Only when you have o11y as part of the Software Development Lifecycle, the org can move fast. Because without failback options, you’re hampering stability.

You have to view all o11y as a mindset change. It’s not a job you understand and know; it’s constant practice, learning, experience, and assimilation of context over extended periods of time.

The ability to ship fast has a cost. This cost is what o11y negates. Which puts all o11y in a simple container;

💡Engineering organizations that ship fast have Observability as part of their core DNA.

Reliability is a function of failures. The more 9s you achieve in your uptimes, the more reliable you get. o11y tools are the only way to make your system resilient. In a cultural context, it also forces engineers to write clean code, and own the uptime of their code. Because if you’re constantly not able to deploy, then as an engineer, you’re failing the org’s scale.

You’ll notice this pattern in orgs clearly; ones that don’t deploy fast enough will have poor o11y tooling. Good Site Reliability Engineers help teams move fast. It’s a simple telltale: if you have strong SREs, you’ll notice a better propensity for engineers to move fast.

💡

The Last9 promise — We will reduce your TCO by about 50%. Our managed time series ~~database~~ data warehouse, Last9, comes with streaming aggregation, data tiering, and the ability to manage high cardinality. If this sounds interesting, talk to us.

Observability is a practice, not a job

Contents

The entropy of o11y

o11y as a mindset

Contents

Start observing for free. No lock-in.

OpenTelemetry · Prometheus

Datadog · New Relic · Others

Built on Open Standards