Less War, More Room: Breaking Down Operational Silos

Recently, our Developer Evangelist Prathamesh Sonpatki gave a talk at a ClickHouse meetup titled "Less War, More Room: Confessions of a Reformed Alert Hoarder." As he described the all-too-familiar "3 AM War Room Bingo," there were knowing nods throughout the audience.

His presentation sparked further conversations with customers that I wanted to share as we explore how observability challenges impact teams across the industry.

Listen to Prathamesh's talk from the ClickHouse meetup

The 3 AM War Room Reality

Prathamesh opened his talk with a scenario all engineering teams have lived through: the dreaded 3 AM alert.

First, you dismiss it as "probably just a blip." Then, as more alerts arrive, you find yourself asking, "What changed?" Soon, you're running increasingly desperate commands while Slack notifications pile up.

By 3:30 AM, five people are on the emergency call, and someone inevitably suggests that Kubernetes is either the problem or the solution (depending on your current infrastructure). This isn't just an inconvenient awakening – it's a signal that your observability approach has fundamental gaps.

Descending into "Patal Log"

"Patal Log" perfectly captures the logging hell that teams unwittingly create for themselves. For those unfamiliar, it's a wordplay on Patal Lok, referring to the netherworld or underworld in South Asian cosmology.

Even in conversations with folks after the talk, I heard this same pain described repeatedly, as we've time and time again – the gap between logging ideals and reality:

Log Everything (Ideal):

Structured, consistent logging across services
High-cardinality data that adds meaningful context
Events that tell a coherent story about system behavior

Log Anything (Reality):

Random console.log("here") statements scattered throughout codebases
Unstructured text that's nearly impossible to parse
Non-thoughtful severity levels that create noise

This divide in what you really need and what it ends up being invariably leads to future you cursing past you at 3 AM when you can't find what you need in that underworld of meaningless log entries.

💡

If you're wondering how much logging is too much, check out our blog on log anything vs. log everything.

The Observability Gap

Prathamesh highlighted how the above gap is not only reflected but also exaggerated in current observability practices versus what teams need.

Current Reality:

Multiple dashboards across different tools
Telemetry data scattered across systems
Configuration sprawl that no one fully understands
Alert fatigue from noisy notifications

What Teams Actually Need:

A single source of truth for operational data
Unified telemetry across metrics, logs, and traces
Centralized control over data processing
Clear signal that cuts through the noise

Teams end up with different monitoring tools for various reasons — and many times, it's also because they're optimizing for cost and splitting telemetry across tools based on criticality.

This proliferation of tools without integration creates what Prathamesh aptly called "operation silos". So many tools, and yet are struggling to quickly identify the root cause during incidents.

Probo Cuts Monitoring Costs by 90% with Last9

Breaking the Alert Hoarding Cycle

What became clear with Prathamesh's talk is that most teams are stuck in a vicious cycle of alert hoarding: insufficient visibility leads to more alerts, which create more noise, which makes underlying issues harder to see, which leads to even more alerts as compensation.

We've observed this pattern repeatedly across organizations of all sizes. They know their observability is broken but don't see a practical path forward that doesn't involve rewriting applications or undergoing massive organizational change.

Technology Foundations Make a Difference

At Last9, our telemetry data platform leverages technologies like OpenTelemetry and ClickHousethe as part of its foundation. Prathamesh touched on these technologies in his talk, particularly in the context of the ClickHouse meetup.

OpenTelemetry's standardization capabilities and ClickHouse's performance characteristics offer powerful building blocks for modern observability. OTel adoption has been ramping up rapidly and is now the second most active CNCF project — it allows teams to be vendor-neutral, brings in standardization by using the same agent across sources, and enables telemetry correlation.

ClickHouse, used by teams at Cloudflare, Spotify, Lyft, and more, is one of the best data stores when it comes to speed and performance per dollar — engines for handling different telemetry types, control over the schemas, and native SQL support — makes it a great option to bring metrics, logs, and traces into one place.

However, as we've seen with customers, these technologies alone don't solve the alert hoarding problem if teams still can't easily process and transform their existing telemetry data.

The Control Plane: Where Transformation Happens

Connecting back to "what teams actually need," the most engaging portion of Prathamesh's talk focused on the concept of a Control Plane for observability data — the layer that enhances, routes, and processes telemetry in transit without requiring application or instrumentation changes.

In subsequent customer discussions, I've repeatedly heard how transformative this approach has been. Here are the Last9 Control Plane capabilities that teams have found most valuable:

Extract & Remap

A media streaming customer recently shared how they transformed and standardized their CDN logs by extracting tenant IDs from request paths of one source and query parameters of another at the Control Plane level.

This made tenant-specific monitoring possible without modifying their CDN configuration or application code, allowing them to identify and address customer-specific issues before they became widespread.

Drop & Filter

Another customer discovered that a significant logging volume consisted of DEBUG-level logs that were rarely queried but were costing them significantly in storage and processing.

By implementing filtering at the control plane, they maintained the ability to re-enable these logs when needed while dramatically reducing their baseline costs and noise.

Forward & Rehydrate

Compliance requirements often force customers to retain certain logs for years, but their previous observability solution made this prohibitively expensive.

The ability to automatically forward specific data to cold storage while maintaining the option to rehydrate it when needed allowed them to meet compliance requirements without compromising their operational visibility or budget.

Context & Correlation

Perhaps the most powerful capability, as Prathamesh emphasized, is seeing "what changed" alongside symptoms. A customer shared that before implementing a control plane approach, their average incident resolution time was 97 minutes.

After gaining the ability to standardize telemetry and extract its attributes with Last9 and correlate metrics with Change Events like deployment events, configuration changes, and infrastructure scaling, they reduced that to 24 minutes — a 75% improvement.

Meeting Teams Where They Are

Never underestimate how important workflow compatibility is to successful observability. The best technical solution fails if it doesn't fit how teams actually work.

Take this example of a typical customer with three distinct technical teams:

Their SRE team lives in Grafana dashboards they'd built over years
Their backend team prefers SQL-based analysis
Their frontend team thinks in terms of user journeys and request flows

Without the flexibility of multiple approaches that each team requires, any previous attempts to consolidate on a single observability tool will have failed because each team would lose capabilities they considered essential.

Last9 supports multiple interfaces (both a native UI and an embedded Grafana) and query languages (SQL, PromQL, LogQL, and TraceQL ), making it easier to achieve unified observability without workflow disruption.

💡

Stay updated on our latest features and improvements in our changelog.

From War Rooms to Restful Nights

The journey from alert hoarding to intentional observability isn't about achieving perfection — it's about having just enough of the right information when it matters. It's about creating more room for thoughtful analysis and less war-room firefighting.

By breaking down operational silos and building bridges between disparate data sources, teams are finding they can understand what changed, why it matters, and how to fix it — often before anyone gets paged at 3 AM.

And that's a transformation worth making.

💡

Come talk to us about how Last9 can help you with achieving some sanity. Or, you can also start playing around by signing up for free.

Less War, More Room: Breaking Down Operational Silos

Contents

The 3 AM War Room Reality

Descending into "Patal Log"

The Observability Gap

Breaking the Alert Hoarding Cycle

Technology Foundations Make a Difference

The Control Plane: Where Transformation Happens

Extract & Remap

Drop & Filter

Forward & Rehydrate

Context & Correlation

Meeting Teams Where They Are

From War Rooms to Restful Nights

Contents

Do More with Less

Handcrafted Related Posts

We’ve raised a $11M Series A led by Sequoia Capital India!

Introducing Levitate: Uplift Your Metrics Management

How we tame High Cardinality by Sharding a stream