Product

Discover
Auto-discover everything you run & trace problems to the root cause—fast

Services Kubernetes Jobs Hosts Applications (RUM)

Explore

Stream & analyze millions of events per minute, all correlated

Logs Traces Metrics

Control Plane
First-class DX to manage incoming telemetry data in real-time

Ingestion Storage Query Analytics

AI
Natural language insights & debugging in your IDE

Synthetic Monitoring
Uptime checks that end in a trace

GPU Workloads
Utilization & performance for GPU fleets

Alerting
For high-cardinality environments
Resources

Guides
Comprehensive docs for engineers building large-scale applications

OpenTelemetry High Cardinality Prometheus LogQL

Blog
Stories, guides, and lessons from the world of observability

Events
SRE & DevOps sharing meets

Changelog
Updates and improvements
Customers
Docs
Book demo

Sre illustration

Sre

All articles tagged 'Sre'

What is AI SRE? The Complete Guide to AI-Assisted Site Reliability Engineering

What is AI SRE? The Complete Guide to AI-Assisted Site Reliability Engineering

It's 2:47 AM. PagerDuty fires. You open a Slack alert and see: p99 latency spike on checkout-service. You SSH into the host, check dashboards in four tabs, grep logs for the last 20 minutes, and eventually find a slow query introduced in a deploy six hours ago. It took 34 minutes. You resolved it, w

Prathamesh Sonpatki

Apr 26, 2026

systemctl logs: A Guide to Managing Logs in Linux

systemctl logs: A Guide to Managing Logs in Linux

Learn how to manage and view systemctl logs in Linux with this guide, covering essential commands and best practices for troubleshooting.

Faiz Shaikh

Dec 10, 2024

A Guide to Database Optimization for High Traffic

A Guide to Database Optimization for High Traffic

Learn how to optimize your database for high traffic, ensuring performance, scalability, and reliability under heavy load.

Prathamesh Sonpatki

Dec 9, 2024

SRECon EMEA 2024 - Day 1

SRECon EMEA 2024 - Day 1

Here’s a quick rundown of the standout talks, big ideas, and memorable moments that kicked things off in SRECon EMEA Dublin 2024!

Prathamesh Sonpatki

Oct 30, 2024

How We Cut Monitoring Costs and Deprecated Thanos at Replit

How We Cut Monitoring Costs and Deprecated Thanos at Replit

Winning Replit over by taming High Cardinality data and deprecating Thanos

Prathamesh Sonpatki

Jun 7, 2024

Cricket Scale e01 — Ashutosh Agrawal

Cricket Scale e01 — Ashutosh Agrawal

Unpacking "Cricket Scale" with the person behind the scenes at JioCinema

Prathamesh Sonpatki

Mar 5, 2024

MTTF vs MTBF vs MTTD vs MTTR

MTTF vs MTBF vs MTTD vs MTTR

This article covers questions such as what are MTTF, MTBF, MTTD, and MTTR, their differences, how to adopt them, and their use cases.

Last9

Apr 6, 2023

Recap of SRECon Americas 2023

Recap of SRECon Americas 2023

SRECon is a conference hosted by USENIX and is focused on site reliability, distributed systems, and systems engineering at scale. A Recap of SRECon Americas 2023.

Last9

Mar 29, 2023

Introducing Levitate: Uplift Your Metrics Management

Introducing Levitate: Uplift Your Metrics Management

Managing time series databases is hard. We've evolved to services, yet monitoring lags. Our solution powers critical workloads at a lower cost.

Nishant Modak

Jan 11, 2023

The importance of structured communication in the world of SRE

The importance of structured communication in the world of SRE

How you communicate helps build your 9s. In the world of Site Reliability Engineering, this is crucial. How do you do it?

Saurabh Hirani

Dec 27, 2022

Thanos v/s Cortex

Thanos vs Cortex

In-depth comparison of Cortex and Thanos, what specifically they help teams do, challenges in implementing both, and how to think about what’s right for your team.

Sahil Khan

Dec 16, 2022

Static Threshold vs. Dynamic Threshold Alerting

Static Threshold vs. Dynamic Threshold Alerting

What's the difference between Static Threshold vs Dynamic Threshold Alerting? Do you really know when and how to use each threshold type?

Last9

Oct 18, 2022

Sample vs Metrics vs Cardinality

Sample vs Metrics vs Cardinality

When dealing with Time Series databases, I always got confused with Sample vs Metrics vs Cardinality. Here’s an explanation as I have understood it.

Piyush Verma

Aug 22, 2022

Why Service Level Objectives?

Why Service Level Objectives?

Understanding how to measure the health of your servcie, benefits of using SLOs, how to set compliances and much more...

Piyush Verma

Mar 16, 2022

Best Practices for Postmortems: A guide

Best Practices for Postmortems: A guide

The ins and outs of conducting an effective postmortem. Ready templates and examples from leading organizations around the world!

Prathamesh Sonpatki

Mar 1, 2022

Choosing Effective SLIs

Choosing Effective SLIs

Practical advice to choose an effective SLI.

Akshay Chugh

Feb 25, 2022

The origin of Service Level Objectives

The origin of Service Level Objectives

Service Level Objectives (SLOs) dominate the software industry, but where did they come from?

Akshay Chugh

Piyush Verma

Feb 21, 2022

Running a Database on EC2 is Slowing It Down

Running a Database on EC2 is Slowing It Down

Learn everything about the advantages of EC2, it's use cases and how to optimize EC2 further.

Jayesh Bapu Ahire

Akshay Chugh

Feb 20, 2022

Deployment Readiness Checklists

Deployment Readiness Checklists

A ready checklist of a comprehensive list of steps and activities involved in the deployment of your application.

Prathamesh Sonpatki

Feb 19, 2022

The most interesting talks from SRECon 2021!

The most interesting talks from SRECon 2021!

SRECon, hosted by USENIX, focuses on site reliability and systems engineering at scale. Discover highlights from the most interesting talks at SRECon 2021.

Akshay Chugh

Feb 15, 2022

Doing SRE the Right Way!

Doing SRE the Right Way!

A well-thought-out approach to SRE, which will help site reliability engineers and software engineers develop and maintain a useful, consistent, and effective SRE strategy for their products!

Piyush Verma

Feb 11, 2022

Microservices - Tracking Dependencies

Microservices - Tracking Dependencies

Quick primer into microservices architecture and the importance of tracking dependencies

Akshay Chugh

Jayesh Bapu Ahire

Feb 1, 2022

SLOs eased

SLOs eased

You can either love running or hate running, but you will definitely love this analogy - take a fresh look at SLOs!

Piyush Verma

Saurabh Hirani

Jan 28, 2022

AWS security groups: canned answers and exploratory questions

AWS security groups: canned answers and exploratory questions

While using a Terraform lifecycle rule, what do you do when you get a canned response from a security group?

Saurabh Hirani

Jul 1, 2021

If it ain't broke...

If it ain't broke...

A Terraform lifecycle rule in the right place can help prevent a deadlock. But the same lifecycle rule in the wrong place?

Saurabh Hirani

Jul 1, 2021

mv aws-security-group shoot-foot

mv aws-security-group shoot-foot

How you can run into an unplanned downtime while making a seemingly harmless change of renaming an AWS security group through Terraform?

Saurabh Hirani

Jul 1, 2021

Much That We Have Gotten Wrong About SRE

Much That We Have Gotten Wrong About SRE

An illustrated summary of Developers ➡ DevOps ➡ SRE

Piyush Verma

Nov 18, 2020