All Topics / Reliability
Reliability

Software Observability from the Lens of Radar and a Black Box
Observability is often a misunderstood and misused term. It has come to mean nothing and everything at this point. Read more on how Observability can be viewed from the lens of a Radar and a Black Box.
Nishant Modak

A case for Observability outside engineering teams
Observability is being built by engineers for engineers. In reality, o11y is for all.
Aniket Rao

How we tame High Cardinality by Sharding a stream
Using 'Sharding' to tame High Cardinality data for Levitate - Our Time Series Data Warehouse
Piyush Verma

High Cardinality for Dummies: ELI5
High Cardinality woes are far & frequent in today's modern cloud-native environment. What does it mean, & why is it such a pressing problem?
Mohan Dutt Parashar

MTTF vs MTBF vs MTTD vs MTTR
This article covers questions such as what are MTTF, MTBF, MTTD, and MTTR, their differences, how to adopt them, and their use cases.
Last9

The neglected tech arctic winter — Internal SaaS expenses
The current tech winter has a number of glaring stories — cyclical as they may be, there’s one truth that’s been gleaned over more than the rest; the money spent on internal software tools to support tech infrastructure is bloated. And there’s nothing cyclical about this infrastructure spending.
Nishant Modak

Introducing Levitate: ‘uplifting’ your metrics woes because self-management sucks like gravity
Managing your own time series database is painful. We’ve moved from servers to services, and yet, monitoring metrics data is primitive. Our managed time series database powers mission-critical workloads for monitoring, at a fraction of the cost.
Nishant Modak

The difference between DevOps, SRE, and Platform Engineering
In reliability engineering, three concepts keep getting talked about - DevOps, SRE and Platform Engineering. How do they differ?
Prathamesh Sonpatki

India vs Pakistan, Site Reliability Engineering, and Shannon Limit
How does one ‘detect change’ in a complex infrastructure, so you don’t lose out on critical revenues — A short SRE story
Satyajeet Jadhav

Battling Alert Fatigue
What is Alert Fatigue and techniques to reduce it
Last9

Why MTTR should be a ‘business’ metric
One of the many pitfalls of friction between engineering and business is the lack of fundamental measurements on the health of engineering. But how does business measure engineering efficacy, and how does engineering posit its standing to business?
Sidu Ponnappa

Sample vs Metrics vs Cardinality
When dealing with Time Series databases, I always got confused with Sample vs Metrics vs Cardinality. Here’s an explanation as I have understood it.
Piyush Verma
Reliability Tools
A guide through the most popular DevOps and SRE tools for building your reliability stack.
Abhi Puranam
Best Practices for Postmortems: A guide
The ins and outs of conducting an effective postmortem. Ready templates and examples from leading organizations around the world!
Prathamesh Sonpatki

Choosing Effective SLIs
Practical advice to choose an effective SLI.
Akshay Chugh
Deployment Readiness Checklists
A ready checklist of a comprehensive list of steps and activities involved in the deployment of your application.
Prathamesh Sonpatki
The most interesting talks from SRECon 2021!
SRECon is a conference hosted by USENIX and is focused on site reliability, distributed systems, and systems engineering at scale. Learn about some of the most interesting talks from SRECon 2021.
Akshay Chugh
Doing SRE the Right Way!
A well-thought-out approach to SRE, which will help site reliability engineers and software engineers develop and maintain a useful, consistent, and effective SRE strategy for their products!
Piyush Verma
Getting the big picture with Log Analysis
How to get the most out of your logs!
Jayesh Bapu Ahire

Microservices - Tracking Dependencies
Quick primer into microservices architecture and the importance of tracking dependencies
Akshay Chugh, Jayesh Bapu Ahire
Components in Designing Effective SLOs
A primer on how to design and implement effective Serice Level Objectives(SLOs)
Akshat Goyal

Sleep Friendly Alerting
We've all been woken up with that dreaded Slack notification at ungodly hours only to realise that the alert was all smoke and no fire. The perfect recipe for dread and alert fatigue.
Akshat Goyal