All Topics / Last9 Engineering
Last9 Engineering
Posts about code, practices and experience of building Last9

Real-Time Canary Deployment Tracking with Argo CD & Levitate
Use Levitate's powerful domain events to track success of canary rollouts via ArgoCD
Preeti Dewani

Monitor Google Cloud Functions using Pushgateway and Levitate
How to monitor serverless async jobs from Google Cloud Functions with Prometheus Pushgateway and Levitate using the push model
Aniket Rao

Golang Concurrency Masterclass by Swati Modi at Gophercon 2023
Talk on Golang Concurrency Masterclass by Swati Modi at Gophercon 2023
Last9

Do more with your metrics by Piyush Verma at GopherConIndia 2022
Piyush Verma's talk at GopherCon India 2022 on Do More with Your Metrics with Last9 and Levitate
Last9

Unwiring High Cardinality - SRE Day 2023
Report from SRE Day 2023, where Piyush Verma - CTO Last9, gave a talk on Unwiring High Cardinality
Last9

Levitate - Last9’s managed TSDB is now available on the AWS Marketplace
Levitate - Last9's managed Prometheus Compatible TSDB is available on AWS Marketplace
Prathamesh Sonpatki

PromQL Macros in Levitate
Define PromQL Macros to standardize complex PromQL queries in Levitate
Prathamesh Sonpatki

GCP Managed Service For Prometheus vs. Levitate
A detailed comparison of Levitate and Google Managed Prometheus - Cost, Scale and Ease of Use
Prathamesh Sonpatki

A case for Observability outside engineering teams
Observability is being built by engineers for engineers. In reality, o11y is for all.
Aniket Rao

Understanding the Rasmussen model for failures
What does the Rasmussen model teach us about Site Reliability Engineering?
Nishant Modak

How we tame High Cardinality by Sharding a stream
Using 'Sharding' to tame High Cardinality data for Levitate - Our Time Series Data Warehouse
Piyush Verma

1979, a nuclear accident and SRE
Deep diving into the 'Normal accident' theory by Charles Perrow, and what it means for SREs
Aniket Rao

How we tame high cardinality in time series databases
Engineering innovation to solve high cardinality with Levitate - a multi-part series
Piyush Verma, Swati Modi

OpenTelemetry for dummies: ELI5
What is OpenTelemetry? Why is it important? Do SREs need to adopt OTel? An Explain It Like I'm 5.
Mohan Dutt Parashar

What Site Reliability Engineering needs — A swarm of rogue bees
If all companies are software companies, all companies need better Observability to understand how performative their software is
Aniket Rao

Take back control of your Monitoring
Take back control of your Monitoring with Levitate - a managed time series data warehouse
Nishant Modak

Observability is a practice, not a job
Engineering organizations that ship fast have Observability as part of their core DNA.
Aniket Rao

Streaming Aggregation vs Recording Rules
Streaming Aggregation and Recording Rules are two ways to tame High Cardinality. What are they? Why do we need them? How are they different?
Last9

Using a Golang package in Python using Gopy
Using Golang package in Python using Gopy: A simple way to leverage the power of Golang packages in Python applications.
Arjun Mahishi

High Cardinality for Dummies: ELI5
High Cardinality woes are far & frequent in today's modern cloud-native environment. What does it mean, & why is it such a pressing problem?
Mohan Dutt Parashar

Who should define Reliability — Engineering, or Product?
Whoever owns Reliability should define its parameters. But who owns the Reliability of a Product? Engineering? Product Management? Or the Customer success team?
Piyush Verma

Observability—OSS vs Paid vs Managed OSS
The Reliability industry needs a managed, non-vendor lock-in answer to spiraling costs, high cardinality and the toil of managing a tsdb
Satyajeet Jadhav

Learnings integrating jmxtrans
JMX metrics give solid insights into the workings of your application. Integrating them with Levitate (our time series data warehosue) required us to jump some hoops with vmagent.
Saurabh Hirani

The neglected tech arctic winter — Internal SaaS expenses
The current tech winter has a number of glaring stories — cyclical as they may be, there’s one truth that’s been gleaned over more than the rest; the money spent on internal software tools to support tech infrastructure is bloated. And there’s nothing cyclical about this infrastructure spending.
Nishant Modak

Understanding “Cricket Scale”
How does a DevOps/Site Reliability Engineer plan for "Cricket scale"? How do you warm systems' about to witness 30+ million concurrent users?
Aniket Rao

What is MTBI?
Everything you need to know about Mean Time Between Incidents (MTBI) and how it can help Site Reliability Engineers
Last9

Rethinking Anomaly Detection: Focus on business outcomes
From the trenches at Games24x7 — Sanjay, on how Reliability engineering should drive core business metrics
Sanjay Singh

Observability is dead, long live observability
No tool can magically offer you 99.999s. Observability is largely about the basics. And basics are boring. But, boring is hard. Boring is battle tested.
Aniket Rao

Self-managed Prometheus vs Managed Prometheus
What are the differences between Self-managed Prometheus vs Managed prometheus? How do you choose what works for you?
Last9

The importance of structured communication in the world of SRE
How you communicate helps build your 9s. In the world of Site Reliability Engineering, this is crucial. How do you do it?
Saurabh Hirani

The difference between DevOps, SRE, and Platform Engineering
In reliability engineering, three concepts keep getting talked about - DevOps, SRE and Platform Engineering. How do they differ?
Prathamesh Sonpatki

Golang's Stringer tool
Learn about how to use, extend and auto-generate Stringer tool of Golang
Arjun Mahishi

How to improve Prometheus remote write performance at scale
Deep dive into how to improve the performance of Prometheus Remote Write at Scale based on real-life experiences
Saurabh Hirani

Prometheus vs InfluxDB
What are the differences between Prometheus and InfluxDB - use cases, challenges, advantages and how you should go about choosing the right tsdb
Last9

India vs Pakistan, Site Reliability Engineering, and Shannon Limit
How does one ‘detect change’ in a complex infrastructure, so you don’t lose out on critical revenues — A short SRE story
Satyajeet Jadhav

Battling Alert Fatigue
What is Alert Fatigue and techniques to reduce it
Last9

Guide to Service Level Indicators and Setting Service Level Objectives
A guide to set practical Service Level Objectives (SLOs) & Service Level Indicators (SLIs) for your Site Reliability Engineering practices.
Last9

Kubernetes Monitoring with Prometheus and Grafana
A guide to help you implement Prometheus and Grafana in your Kubernetes cluster
Last9

Why we auto-delete slack messages - killing tribal knowledge at Last9
At last9, we auto-delete slack messages after 2 days on all personal Direct Messages. These retention policies force teams to improve documentation, kill tribal knowledge and drive accountability for mistakes, errors.
Nishant Modak

Static Threshold vs. Dynamic Threshold Alerting
What's the difference between Static Threshold vs Dynamic Threshold Alerting? Do you really know when and how to use each threshold type?
Last9

How we won Dukaan over
5 meetings. 1 month. From introductions, to a demo, and ultimately winning Dukaan over. Subhash and his team’s velocity on decision-making, moving fast, and radical candor, is a breath of fresh air in the Indian startup ecosystem.
Aniket Rao

How to restart Kubernetes Pods with kubectl
A query that keeps popping up, so decided to write a simple reckoner on how to restart a Kubernetes pod with kubectl
Last9

How to calculate HTTP content-length metrics on cli
A simple guide to crunch numbers for understanding overall HTTP content length metrics.
Saurabh Hirani

Choosing Effective SLIs
Practical advice to choose an effective SLI.
Akshay Chugh
Running a Database on EC2 is Slowing It Down
Learn everything about the advantages of EC2, it's use cases and how to optimize EC2 further.
Jayesh Bapu Ahire, Akshay Chugh
Doing SRE the Right Way!
A well-thought-out approach to SRE, which will help site reliability engineers and software engineers develop and maintain a useful, consistent, and effective SRE strategy for their products!
Piyush Verma

Microservices - Tracking Dependencies
Quick primer into microservices architecture and the importance of tracking dependencies
Akshay Chugh, Jayesh Bapu Ahire
SLOs eased
You can either love running or hate running, but you will definitely love this analogy - take a fresh look at SLOs!
Piyush Verma, Saurabh Hirani
Rescuing a SPAghetti React project
I gave a talk at react.geekle.us [https://react.geekle.us] today about improving reliability of our React app. Here are slides of that talk. -------------------------------------------------------------------------------- Here is transcript of the talk. -------------------------------------------------------------------------------- Hello all, my name is Prathamesh Sonpatki. I work at Last9 [https://last9.io] building a world class operational intelligence platform for SREs. The Last9 p
Prathamesh Sonpatki
One year at Last9
I completed one year at Last9 today. When I joined Last9 on April 20, 2020, last year, I was unsure how it would pan out. I only knew Nishant and Piyush - founders of Last9 from the Pune tech community. But I had never worked with them before. I was also unsure about the product Last9 was building. I had never worked in the SRE domain. I didn't know anything about the problems the SREs of the programming world face. I was also a Rails developer, having worked primarily on Ruby on Rails for 6-7
Prathamesh Sonpatki