How We Cut Monitoring Costs and Deprecated Thanos at Replit

“Show me what it can do.”

It took us about three weeks to win Replit over, and we currently run their entire backend infrastructure monitoring on Last9, our managed time series data warehouse.

It all started at a conference when I met Matthew Iselin, who runs SRE at Replit. I did the usual chat about Last9 and spoke about “Cricket Scale”, and the kind of numbers Last9 can soak up.

As a fan of the Australian national cricket team, Matt got it instantly.

More than 20 million developers worldwide use Replit. The team needed a monitoring setup to handle high cardinality data and perform trend analysis on their infrastructure metrics for better capability planning.

By default, Last9 offers 20 million time series per day per metric cardinality. To emphasize it again: This is not for all your metrics, but per metric quota. To give you a comparison, most TSDBs offer only up to 2-3 million cardinality for all your metrics together.

Last9 is designed to tame High Cardinality metrics. These limits can also be increased on demand. And we were excited about this challenge.

Matt was interested, but above all, he was a person of action. And moves super quickly. Immediately after the meeting, we did a call, and Matt was ready to see Last9 in action.

This was refreshing for someone like me, who is used to prolonged ‘procurement’ cycles.

Talk is cheap.

Matt’s approach was a challenge.

His premise was fairly simple — “Does this thing work as you claim?” If yes, let’s put it to the test.

Matt wanted proof and wanted it quick. This made sense, after all, he did learn Site Reliability Engineering at Google; the gold standard in the space.

Matt had worked with multiple monitoring solutions in the past. With Thanos being the current setup, teams in Replit were finding it difficult to query their TSDB and get results quickly. The team had also tried hosted Cortex, which was simply too expensive for Replit’s cardinality requirements. Time and toil had taken a toll on the team.

The very next day, a Proof Of Concept (PoC) was set up.

But before that, we threw a challenge for Matt’s team:

‘Send us your worst metrics, and let’s stress test Last9 to the limits’.

I was particularly nervous about this PoC because I’m a huge fan of Replit, and how polished their engineering teams are.

Let’s bring Last9 down.

Matt took this as a challenge. And our goal was to give Matt every opportunity to break Last9 😎.

If Last9 did break, we had to talk. If it did not, Matt got what our control workflows for high cardinality were like. Within three days, the PoC was underway, and Last9 started working its magic.

We were handling about 10 million active time series with considerable ease.

These were nervous days for me, and I was checking with the engineering team to see how things were going.

I was a little too eager and messaged Matt to see if things were alright. That’s when I realized the radio silence was because select folks from Replit’s engineering team were testing Last9.

It was a full-fledged test to see if Last9 could hold up to all the talk about managing high cardinality. And then… this below from Matt 👇.

We sign the contract.

The entire process takes about three weeks to close.

Where are we now?

Last9 has completely replaced the self-hosted Thanos at Replit.

We were able to get rid of Thanos and all its associated hassles without a significant increase in TCO. Feels incredible to see this journey piece together in record time and see tangible results.

Replit is empowering the next billion software creators. Their new Replit AI tool helps developers boost productivity and creativity while coding. Last9 powers this entire monitoring experience 😎.

For us, it’s a pleasure working with a company that prides itself on stellar engineering and cares for its excellence.

Replit is a testimonial to our obsessiveness around solving complex engineering problems. It reminds me of our Rta, and I’m hoping to find more gems like Replit to power monitoring.

If you want to give Last9 a spin, get started here.

Want to chat about anything related to software monitoring? I’m on twitter/X - https://twitter.com/prathamesh2_

How We Cut Monitoring Costs and Deprecated Thanos at Replit

Contents

Talk is cheap.

Let’s bring Last9 down.

Where are we now?

Contents

Start observing for free. No lock-in.

OpenTelemetry · Prometheus

Datadog · New Relic · Others

Built on Open Standards