Last9 Last9

Jun 7th, ‘24 / 3 min read

How We Cut Monitoring Costs and Deprecated Thanos at Replit

Winning Replit over by taming High Cardinality data and deprecating Thanos

How We Cut Monitoring Costs and Deprecated Thanos at Replit

“Show me what it can do.”

It took us about three weeks to win Replit over, and we currently run their entire backend infrastructure monitoring on Last9, our managed time series data warehouse.

It all started at a conference when I met Matthew Iselin, who runs SRE at Replit. I did the usual chat about Last9 and spoke about “Cricket Scale”, and the kind of numbers Last9 can soak up.

As a fan of the Australian national cricket team, Matt got it instantly.

More than 20 million developers worldwide use Replit. The team needed a monitoring setup to handle high cardinality data and perform trend analysis on their infrastructure metrics for better capability planning.

By default, Last9 offers 20 million time series per day per metric cardinality. To emphasize it again: This is not for all your metrics, but per metric quota. To give you a comparison, most TSDBs offer only up to 2-3 million cardinality for all your metrics together

Last9 is designed to tame High Cardinality metrics. These limits can also be increased on demand. And we were excited about this challenge. 

Last9's Super Cardinality Defaults
Last9's Super Cardinality Defaults

Matt was interested, but above all, he was a person of action. And moves super quickly. Immediately after the meeting, we did a call, and Matt was ready to see Last9 in action.

This was refreshing for someone like me, who is used to prolonged ‘procurement’ cycles.

Talk is cheap.

Matt’s approach was a challenge.

His premise was fairly simple — “Does this thing work as you claim?” If yes, let’s put it to the test.

Matt wanted proof and wanted it quick. This made sense, after all, he did learn Site Reliability Engineering at Google; the gold standard in the space.

Matt had worked with multiple monitoring solutions in the past. With Thanos being the current setup, teams in Replit were finding it difficult to query their TSDB and get results quickly. The team had also tried hosted Cortex, which was simply too expensive for Replit’s cardinality requirements. Time and toil had taken a toll on the team.

The very next day, a Proof Of Concept (PoC) was set up.

But before that, we threw a challenge for Matt’s team:

‘Send us your worst metrics, and let’s stress test Last9 to the limits’.

I was particularly nervous about this PoC because I’m a huge fan of Replit, and how polished their engineering teams are. 

Let’s bring Last9 down.

Matt took this as a challenge. And our goal was to give Matt every opportunity to break Last9 😎.

If Last9 did break, we had to talk. If it did not, Matt got what our control workflows for high cardinality were like. Within three days, the PoC was underway, and Last9 started working its magic.

We were handling about 10 million active time series with considerable ease.

These were nervous days for me, and I was checking with the engineering team to see how things were going.

I was a little too eager and messaged Matt to see if things were alright. That’s when I realized the radio silence was because select folks from Replit’s engineering team were testing Last9.

It was a full-fledged test to see if Last9 could hold up to all the talk about managing high cardinality. And then… this below from Matt 👇.

Matt's review about Last9
Matt's review about Last9

We sign the contract.

The entire process takes about three weeks to close.

Where are we now?

Last9 has completely replaced the self-hosted Thanos at Replit.

We were able to get rid of Thanos and all its associated hassles without a significant increase in TCO. Feels incredible to see this journey piece together in record time and see tangible results.

Replit is empowering the next billion software creators. Their new Replit AI tool helps developers boost productivity and creativity while coding. Last9 powers this entire monitoring experience 😎.

Matt's Review
Matt's Review

For us, it’s a pleasure working with a company that prides itself on stellar engineering and cares for its excellence.

Replit is a testimonial to our obsessiveness around solving complex engineering problems. It reminds me of our Rta, and I’m hoping to find more gems like Replit to power monitoring.

If you want to give Last9 a spin, get started here.

Want to chat about anything related to software monitoring? I'm on twitter/X - https://twitter.com/prathamesh2_

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Prathamesh Sonpatki

Prathamesh Sonpatki

Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

X