All Topics / Observability

Observability

Learn all about observability and how it can transform your systems! Our blogs cover key concepts, benefits, and practical tips to help you gain deeper insights and improve performance.

Why Cloud Security Monitoring is Crucial for Your Business

Why Cloud Security Monitoring is Crucial for Your Business

Cloud security monitoring is essential to protect data, ensure compliance, and safeguard against growing cyber threats in cloud environments.

Anjali Udasi

DNS Monitoring: Everything You Need to Know

DNS Monitoring: Everything You Need to Know

DNS monitoring ensures your domain records are accurate, secure, and performing well, helping prevent outages and attacks.

Anjali Udasi

LLM Observability: Importance, Best Practices, and Steps

LLM Observability: Importance, Best Practices, and Steps

LLM observability is key to ensuring model performance. Learn its importance, best practices, and actionable steps for optimal results and reliability.

Anjali Udasi

MongoDB vs Elasticsearch: Key Differences Explained

MongoDB vs Elasticsearch: Key Differences Explained

Learn the key differences between MongoDB and Elasticsearch, and understand when to use each for your database and search needs.

Anjali Udasi

A Beginner's Guide to GCP Monitoring

A Beginner's Guide to GCP Monitoring

Learn how to monitor and optimize your GCP resources effortlessly. Simplify performance tracking and keep your services running smoothly.

Prathamesh Sonpatki, Anjali Udasi

Fluentd vs Fluent Bit – A Comprehensive Overview

Fluentd vs Fluent Bit – A Comprehensive Overview

Fluentd vs Fluent Bit: Discover the key differences, use cases, and how to choose the right tool for your log processing needs.

Prathamesh Sonpatki, Anjali Udasi

Enhancing Observability with Fluent Bit and OpenTelemetry

Enhancing Observability with Fluent Bit and OpenTelemetry

Boost observability with Fluent Bit and OpenTelemetry! Collect, process, and export logs and metrics easily for smarter monitoring.

Prathamesh Sonpatki

Full-Stack Observability for Better Application Performance

Full-Stack Observability for Better Application Performance

Achieve better application performance with full-stack observability, gaining real-time insights to troubleshoot, optimize, and enhance user experience.

Anjali Udasi

A Complete Guide to Kubernetes Observability

A Complete Guide to Kubernetes Observability

Learn how to implement effective Kubernetes observability with metrics, logs, and traces to monitor and optimize your clusters at scale.

Prathamesh Sonpatki

Getting the Most Out of Tracing Tools for Observability

Getting the Most Out of Tracing Tools for Observability

Maximize your observability with tracing tools to track requests, identify bottlenecks, and optimize system performance across services.

Anjali Udasi

Proactive Monitoring: What It Is, Why It Matters, & Use Cases

Proactive Monitoring: What It Is, Why It Matters, & Use Cases

Proactive monitoring helps IT teams spot issues early, ensuring smooth operations, minimal disruptions, and a better user experience.

Anjali Udasi

OpenSearch vs. Elasticsearch: What’s the Real Difference?

OpenSearch vs. Elasticsearch: What’s the Real Difference?

OpenSearch and Elasticsearch are both powerful search engines, but OpenSearch offers an open-source alternative with community-driven development.

Anjali Udasi

Why Golden Signals Matter for Monitoring

Why Golden Signals Matter for Monitoring

Golden Signals—latency, traffic, error rate, and saturation—help SRE teams monitor system health and avoid costly performance issues.

Anjali Udasi

Last9’s Single Pane for High Cardinality Observability

Last9’s Single Pane for High Cardinality Observability

Last9’s Telemetry Warehouse now supports Logs and Traces, offering a unified view for high cardinality observability to simplify monitoring and troubleshooting.

Sahil Khan

How to Cut Down Amazon CloudWatch Costs

How to Cut Down Amazon CloudWatch Costs

Check out these straightforward tips to manage your metrics and logs better. You can keep your monitoring effective while cutting down on costs!

Anjali Udasi

Prometheus Alternatives: Monitoring Tools You Should Know

Prometheus Alternatives: Monitoring Tools You Should Know

What are the alternatives to Prometheus? A guide to comparing different Prometheus Alternatives.

Gabriel Diaz

What is Prometheus Remote Write

What is Prometheus Remote Write

Explore Prometheus Remote Write: scale your monitoring effortlessly. Learn how it works, its benefits, and top tips for cloud-native setups.

Prathamesh Sonpatki

Streaming Aggregation: Real-Time Data Processing in 2024

Streaming Aggregation: Real-Time Data Processing in 2024

We break down the essentials of streaming aggregation and its impact on modern data processing.

Anjali Udasi

kube-state-metrics: Your Guide to Kubernetes Observability

kube-state-metrics: Your Guide to Kubernetes Observability

This guide provides an in-depth look at its setup and usage, helping you monitor and manage your Kubernetes clusters more efficiently.

Prathamesh Sonpatki, Anjali Udasi

2024's Best Cloud Monitoring Tools: Updated Insights

2024's Best Cloud Monitoring Tools: Updated Insights

Get a detailed look at the top cloud monitoring tools of 2024. Compare leading solutions to understand their features and performance, helping you choose the best fit for your cloud infrastructure.

Anjali Udasi

Top Observability Best Practices for Microservices in 2024

Top Observability Best Practices for Microservices in 2024

Practical tips for monitoring, analyzing, and improving system performance.

Anjali Udasi

A Deep Dive into Log Aggregation Tools

A Deep Dive into Log Aggregation Tools

The guide discusses the essential components, challenges, popular tools, and advanced techniques that define effective log aggregation.

Anjali Udasi

OpenTelemetry vs. Traditional APM Tools

OpenTelemetry vs. Traditional APM Tools

This article explores OpenTelemetry vs. traditional APM tools, comparing their strengths, weaknesses, and use cases to help you choose wisely.

Anjali Udasi

The Anatomy of a Modern Observability System

The Anatomy of a Modern Observability System

This article breaks down the fundamentals, from data collection to analysis, to help you gain deeper insights into your applications.

Anjali Udasi

Observability vs. Telemetry vs. Monitoring

Observability vs. Telemetry vs. Monitoring

Observability is the continuous analysis of operational data, telemetry is the operational data that feeds into that analysis, and monitoring is like a radar for your system observing everything about your system and alerting when necessary.

Anjali Udasi

Think Data Warehouse, NOT Database.

Think Data Warehouse, NOT Database.

The software monitoring world is broken because of a TSDB. We deserve a TSDW

Aniket Rao

What is the OpenTelemetry Collector and How Does It Work?

What is the OpenTelemetry Collector and How Does It Work?

The OpenTelemetry Collector simplifies data collection, processing, and export for metrics, logs, and traces. Learn about its architecture, deployment, and examples.

Prathamesh Sonpatki

What needs to change in software monitoring?

What needs to change in software monitoring?

A wishlist of things that need to change in the world of software monitoring

Aniket Rao

Everything in software monitoring is dead, apparently

Everything in software monitoring is dead, apparently

Chasing shiny new toys, as always ;)

Aniket Rao

Why your monitoring costs are high

Why your monitoring costs are high

If you want to bring down your monitoring costs, you need to shake up a decision paralysis in engineering

Aniket Rao

Software Observability from the Lens of Radar and a Black Box

Software Observability from the Lens of Radar and a Black Box

Observability is often a misunderstood and misused term.  It has come to mean nothing and everything at this point. Read more on how Observability can be viewed from the lens of a Radar and a Black Box.

Nishant Modak

This arctic winter — time to repay your tech debt

This arctic winter — time to repay your tech debt

We're in a peak tech winter. What should engineering teams focus on when product velocity dwindles?

Ajey Gore

Understanding the Rasmussen model for failures

Understanding the Rasmussen model for failures

What does the Rasmussen model teach us about Site Reliability Engineering?

Nishant Modak

How we tame High Cardinality by Sharding a stream

How we tame High Cardinality by Sharding a stream

Using 'Sharding' to tame High Cardinality data for Levitate - Our Time Series Data Warehouse

Piyush Verma

OpenTelemetry for dummies: ELI5

OpenTelemetry for dummies: ELI5

What is OpenTelemetry? Why is it important? Do SREs need to adopt OTel? An Explain It Like I'm 5.

Mohan Dutt Parashar

What Site Reliability Engineering Needs: A Swarm of Bees

What Site Reliability Engineering Needs: A Swarm of Bees

If all companies are software companies, all companies need better Observability to understand how performative their software is

Aniket Rao

QCon New York 2023 Recap

QCon New York 2023 Recap

Recap of QCon New York 2023 Conference

Prathamesh Sonpatki

What is High Cardinality

What is High Cardinality

Overview of what is high cardinality in the context of monitoring using Prometheus and Grafana

Prathamesh Sonpatki

What is OpenTelemetry

What is OpenTelemetry

Learn what is OpenTelemetry: The open-source observability framework for collecting and processing telemetry data from applications and systems.

Last9

Observability is a practice, not a job

Observability is a practice, not a job

Engineering organizations that ship fast have Observability as part of their core DNA.

Aniket Rao

Metrics, Events, Logs, and Traces: Observability Essentials

Metrics, Events, Logs, and Traces: Observability Essentials

Understanding Metrics, Logs, Events and Traces - the key pillars of observability and their pros and cons for SRE and DevOps teams.

Prathamesh Sonpatki

SRE vs Platform Engineering

SRE vs Platform Engineering

What's the difference between SREs and Platform Engineers? How do they differ in their daily tasks?

Last9

Streaming Aggregation vs Recording Rules

Streaming Aggregation vs Recording Rules

Streaming Aggregation and Recording Rules are two ways to tame High Cardinality. What are they? Why do we need them? How are they different?

Last9

Prometheus vs Datadog

Prometheus vs Datadog

Comparison between Prometheus and Datadog - two of the most popular monitoring tools in the market today

Last9

SRE vs DevOps

SRE vs DevOps

What's the difference between SREs and DevOps professionals? How do they differ in their daily tasks?

Last9

High Cardinality for Dummies: ELI5

High Cardinality for Dummies: ELI5

High Cardinality woes are far & frequent in today's modern cloud-native environment. What does it mean, & why is it such a pressing problem?

Mohan Dutt Parashar

Who should define Reliability —  Engineering, or Product?

Who should define Reliability — Engineering, or Product?

Whoever owns Reliability should define its parameters. But who owns the Reliability of a Product? Engineering? Product Management? Or the Customer success team?

Piyush Verma

What do self-driving cars tell us about Site Reliability Engineering?

What do self-driving cars tell us about Site Reliability Engineering?

From Robocars to Reliability — SRE with self-driving cars; mapping out where the Observability space is in conjunction with self-driving cars

Mohan Dutt Parashar

Observability—OSS vs Paid vs Managed OSS

Observability—OSS vs Paid vs Managed OSS

The Reliability industry needs a managed, non-vendor lock-in answer to spiraling costs, high cardinality and the toil of managing a tsdb

Satyajeet Jadhav

High Cardinality? No Problem! Stream Aggregation FTW

High Cardinality? No Problem! Stream Aggregation FTW

Managing high cardinality in time series data is tough but crucial. Learn how Levitate’s streaming aggregations can help tackle it efficiently.

Piyush Verma

Recap of SRECon Americas 2023

Recap of SRECon Americas 2023

SRECon is a conference hosted by USENIX and is focused on site reliability, distributed systems, and systems engineering at scale. A Recap of SRECon Americas 2023.

Last9

Understanding “Cricket Scale”

Understanding “Cricket Scale”

How does a DevOps/Site Reliability Engineer plan for "Cricket scale"? How do you warm systems' about to witness 30+ million concurrent users?

Aniket Rao

Reliability Engineering for Dummies: ELI5

Reliability Engineering for Dummies: ELI5

Explaining Reliability Engineering to a 5-year-old.

Mohan Dutt Parashar

Rethinking Anomaly Detection: Focus on business outcomes

Rethinking Anomaly Detection: Focus on business outcomes

From the trenches at Games24x7 — Sanjay, on how Reliability engineering should drive core business metrics

Sanjay Singh

Interesting talks on Observability from Fosdem 2023

Interesting talks on Observability from Fosdem 2023

A recap of the talks from the Observability and Monitoring dev room at Fosdem 2023.

Prathamesh Sonpatki

Observability is dead, long live observability

Observability is dead, long live observability

No tool can magically offer you 99.999s. Observability is largely about the basics. And basics are boring. But, boring is hard. Boring is battle tested.

Aniket Rao

Introducing Levitate: Uplift Your Metrics Management

Introducing Levitate: Uplift Your Metrics Management

Managing time series databases is hard. We've evolved to services, yet monitoring lags. Our solution powers critical workloads at a lower cost.

Nishant Modak

Best Practices Using and Writing Prometheus Exporters

Best Practices Using and Writing Prometheus Exporters

This article will go over what Prometheus exporters are, how to properly find and utilize prebuilt exporters, and tips, examples, and considerations when building your own exporters.

Last9

The difference between DevOps, SRE, and Platform Engineering

The difference between DevOps, SRE, and Platform Engineering

In reliability engineering, three concepts keep getting talked about - DevOps, SRE and Platform Engineering. How do they differ?

Prathamesh Sonpatki

How to improve Prometheus remote write performance at scale

How to improve Prometheus remote write performance at scale

Deep dive into how to improve the performance of Prometheus Remote Write at Scale based on real-life experiences

Saurabh Hirani

India vs Pakistan: SRE and the Shannon Limit

India vs Pakistan: SRE and the Shannon Limit

How does one ‘detect change’ in a complex infrastructure, so you don’t lose out on critical revenues — A short SRE story

Satyajeet Jadhav

Challenges of Distributed Tracing

Challenges of Distributed Tracing

What are the challenges, benefits and use cases of distributed tracing?

Last9

Why MTTR should be a ‘business’ metric

Why MTTR should be a ‘business’ metric

A key challenge is aligning engineering health metrics with business goals. How can business measure engineering, and engineering show its value?

Sidu Ponnappa

Sample vs Metrics vs Cardinality

Sample vs Metrics vs Cardinality

When dealing with Time Series databases, I always got confused with Sample vs Metrics vs Cardinality. Here’s an explanation as I have understood it.

Piyush Verma

Comparing Popular Time Series Databases

Comparing Popular Time Series Databases

A comparison of all the popular time series databases. Prometheus, Influx, M3Db, Levitate.

Abhi Puranam

Latency is the new downtime

Latency is the new downtime

In the early days of Google, a lot of users were asking for 30 results on the first page of search results. So after long deliberation, Marissa Mayer, then the Product Manager for google.com, decided to run the A/B test for ten vs 30 results. When the results came in, they were in for a surprise.

Sahil Khan

We’ve raised a $11M Series A led by Sequoia Capital India!

We’ve raised a $11M Series A led by Sequoia Capital India!

Exciting news! We've secured an $11M Series A funding round led by Sequoia Capital India to fuel our growth and innovation at Last9!

Nishant Modak

Why Service Level Objectives?

Why Service Level Objectives?

Understanding how to measure the health of your servcie, benefits of using SLOs, how to set compliances and much more...

Piyush Verma

The origin of Service Level Objectives

The origin of Service Level Objectives

Service Level Objectives (SLOs) dominate the software industry, but where did they come from?

Akshay Chugh, Piyush Verma

Doing SRE the Right Way!

Doing SRE the Right Way!

A well-thought-out approach to SRE, which will help site reliability engineers and software engineers develop and maintain a useful, consistent, and effective SRE strategy for their products!

Piyush Verma

SLOs eased

SLOs eased

You can either love running or hate running, but you will definitely love this analogy - take a fresh look at SLOs!

Piyush Verma, Saurabh Hirani

Latency SLO

Latency SLO

How do you set latency-based alerts? A common approach is 95% of requests completed in 350ms, but is it really that simple?

Piyush Verma

Services; not Server

Services; not Server

Gone are the days of yore when we named are our servers Etsy, Betsy, and Momo, fed them fish, and cleaned their poop.

Nishant Modak, Piyush Verma

Systems Observability

Systems Observability

Observability is not just about being able to ask questions to your systems. It's also about getting those answers in minutes and not hours.

Nishant Modak, Piyush Verma