Jun 7th, ‘24/3 min read

How We Cut Monitoring Costs and Deprecated Thanos at Replit

Winning Replit over by taming High Cardinality data and deprecating Thanos

How We Cut Monitoring Costs and Deprecated Thanos at Replit

“Show me what it can do.”

It took us about three weeks to win Replit over, and we currently run their entire backend infrastructure monitoring on Levitate, our managed time series data warehouse.

It all started at a conference when I met Matthew Iselin, who runs SRE at Replit. I did the usual chat about Levitate and spoke about “Cricket Scale”, and the kind of numbers Levitate can soak up.

As a fan of the Australian national cricket team, Matt got it instantly.

More than 20 million developers worldwide use Replit. The team needed a monitoring setup to handle high cardinality data and perform trend analysis on their infrastructure metrics for better capability planning.

By default, Levitate offers 20 million time series per day per metric cardinality. To emphasize it again: This is not for all your metrics, but per metric quota. To give you a comparison, most TSDBs offer only up to 2-3 million cardinality for all your metrics together

Levitate is designed to tame High Cardinality metrics. These limits can also be increased on demand. And we were excited about this challenge. 

Matt was interested, but above all, he was a person of action. And moves super quickly. Immediately after the meeting, we did a call, and Matt was ready to see Levitate in action.

This was refreshing for someone like me, who is used to prolonged ‘procurement’ cycles.

Talk is cheap.

Matt’s approach was a challenge.

His premise was fairly simple — “Does this thing work as you claim?” If yes, let’s put it to the test.

Matt wanted proof and wanted it quick. This made sense, after all, he did learn Site Reliability Engineering at Google; the gold standard in the space.

Matt had worked with multiple monitoring solutions in the past. With Thanos being the current setup, teams in Replit were finding it difficult to query their TSDB and get results quickly. The team had also tried hosted Cortex, which was simply too expensive for Replit’s cardinality requirements. Time and toil had taken a toll on the team.

The very next day, a Proof Of Concept (PoC) was set up.

But before that, we threw a challenge for Matt’s team:

‘Send us your worst metrics, and let’s stress test Levitate to the limits’.

I was particularly nervous about this PoC because I’m a huge fan of Replit, and how polished their engineering teams are. 

Let’s bring Levitate down.

Matt took this as a challenge. And our goal was to give Matt every opportunity to break Levitate 😎.

If Levitate did break, we had to talk. If it did not, Matt got what our control workflows for high cardinality were like. Within three days, the PoC was underway, and Levitate started working its magic.

We were handling about 10 million active time series with considerable ease.

These were nervous days for me, and I was checking with the engineering team to see how things were going.

I was a little too eager and messaged Matt to see if things were alright. That’s when I realized the radio silence was because select folks from Replit’s engineering team were testing Levitate.

It was a full-fledged test to see if Levitate could hold up to all the talk about managing high cardinality. And then… this below from Matt 👇.

We sign the contract.

The entire process takes about three weeks to close.

Where are we now?

Levitate has completely replaced the self-hosted Thanos at Replit.

We were able to get rid of Thanos and all its associated hassles without a significant increase in TCO. Feels incredible to see this journey piece together in record time and see tangible results.

Replit is empowering the next billion software creators. Their new Replit AI tool helps developers boost productivity and creativity while coding. Levitate powers this entire monitoring experience 😎.

For us, it’s a pleasure working with a company that prides itself on stellar engineering and cares for its excellence.

Replit is a testimonial to our obsessiveness around solving complex engineering problems. It reminds me of our Rta, and I’m hoping to find more gems like Replit to power monitoring.

If you want to give Levitate a spin get started here.

Want to chat about anything related to software monitoring? I'm on twitter/X - https://twitter.com/prathamesh2_

Contents


Newsletter

Stay updated on the latest from Last9.

Authors

Prathamesh Sonpatki

Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Handcrafted Related Posts