Blog
Stories, guides, and lessons from the world of observability
Follow us on X
Understanding “Cricket Scale”
How does a DevOps/Site Reliability Engineer plan for "Cricket scale"? How do you warm systems' about to witness 30+ million concurrent users?


What is MTBI?
Everything you need to know about Mean Time Between Incidents (MTBI) and how it can help Site Reliability Engineers


Reliability Engineering for Dummies: ELI5
Explaining Reliability Engineering to a 5-year-old.


SLA vs SLO vs SLI - What's the difference
SLAs, SLOs, and SLIs—what’s the difference? For DevOps folks, understanding these nuances is key. Here's a quick guide to each term.


Rethinking Anomaly Detection: Focus on business outcomes
From the trenches at Games24x7 — Sanjay, on how Reliability engineering should drive core business metrics


Interesting talks on Observability from Fosdem 2023
A recap of the talks from the Observability and Monitoring dev room at Fosdem 2023.


Comparing Popular Service Mesh Offerings
An in-depth look at several service mesh offerings and comparison based on their features, licensing and pricing, architecture, and user experience.


Prometheus Monitoring
Prometheus is a popular open-source monitoring system. In this blog, we'll cover the basics of Prometheus monitoring, including its architecture, key features, and alternatives.


Observability is dead, long live observability
No tool can magically offer you 99.999s. Observability is largely about the basics. And basics are boring. But, boring is hard. Boring is battle tested.
