Blog
Stories, guides, and lessons from the world of observability
Follow us on X
Recap of SRECon Americas 2023
SRECon is a conference hosted by USENIX and is focused on site reliability, distributed systems, and systems engineering at scale. A Recap of SRECon Americas 2023.


Understanding “Cricket Scale”
How does a DevOps/Site Reliability Engineer plan for "Cricket scale"? How do you warm systems' about to witness 30+ million concurrent users?


What is MTBI?
Everything you need to know about Mean Time Between Incidents (MTBI) and how it can help Site Reliability Engineers


Reliability Engineering for Dummies: ELI5
Explaining Reliability Engineering to a 5-year-old.


SLA vs SLO vs SLI - What's the difference
SLAs, SLOs, and SLIs—what’s the difference? For DevOps folks, understanding these nuances is key. Here's a quick guide to each term.


Rethinking Anomaly Detection: Focus on business outcomes
From the trenches at Games24x7 — Sanjay, on how Reliability engineering should drive core business metrics


Interesting talks on Observability from Fosdem 2023
A recap of the talks from the Observability and Monitoring dev room at Fosdem 2023.


Comparing Popular Service Mesh Offerings
An in-depth look at several service mesh offerings and comparison based on their features, licensing and pricing, architecture, and user experience.


Prometheus Monitoring
Prometheus is a popular open-source monitoring system. In this blog, we'll cover the basics of Prometheus monitoring, including its architecture, key features, and alternatives.
