One of the most fascinating stories of ‘infrastructure’ engineering is from India. It's not as globally recognized, or spoken about in the same breath as some of the more popular events. I call this the “Cricket Scale”.
Select few companies have witnessed this unprecedented scale, and far fewer understand the many technical challenges of orchestrating such large, high-stakes sporting events. In most of my conversations with peers (outside of Asia), they're stunned by this engineering extravaganza. And there are plenty of untold stories on this front.
What in the world is “Cricket Scale”? 🏏
Few events in the world attract a global audience. Football takes the cake — nearly half the world watched the 2022 World Cup. 113 million people watched the Superbowl.
And then, there’s cricket.
~450 million people watched the last big cricketing event; the Indian Premier League. 30 million concurrent viewers on one app though. That’s 30 million people watching a sporting event together on one application. Concurrency is the real killer here. One has to orchestrate your infra to manage a scale that shoots up and alternatively scales down dramatically.
Then there are these fascinating engineering edge cases one must check out. One that I personally like. 👇
What does all this mean? 💥
For a Site Reliability Engineer (SRE), this is a BIG deal. The team has to warm its internal infrastructure to manage data at such a scale. A lot can go wrong, and much thinking goes into orchestrating an event at that scale.
How crazy do things get? 🚦
At its bare minimum, it’s 600+ million metrics a minute at 200 ms of ingestion. And this is only half the story. War rooms get chaotic with multiple dashboards, static alerting rules go out of the window, and read workloads are even more staggering than writes.
There are few playbooks on how to serve that scale. Not only do engineering teams have to create a playbook, but they also have to stay vigilant on service degradations and customer patterns.
Our time series data warehouse, Levitate does this at an incredible scale.
Customer patterns? 🤔
Everyone has their favorites. When stars such as Indian cricketer Virat Kohli takes the batting crease, traffic spikes. An (SRE) has to provision more servers, understand traffic, and ‘observe’ service degradations.
Engineers need a precise vocabulary for a system to tell them something is not right, and where things are not right. Because systems are complex, understanding what is happening is a complicated problem to solve.
How do SREs manage Cricket Scale?
The ability to map out your entire infrastructure is critical. After all, you can’t measure what you can’t see. Here’s a simple checklist of where to start. We’ve simplified this to absolute basic:
Step 1: Get your instrumentation right. You want to declare your entities that need to be ‘Observed’. Only the immediate critical ones. Measure what matters.
Step 2: Map out CDN Configurations, 3rd party tools powering your platform.
Step 3: Identify key critical Infrastructure metrics that need immediate monitoring.
Step 4: Understand Latencies and availability baselines to write SLOs that matter.
Step 5: Create actionable alerting protocols on SLO degradations. This means understanding the, “If This Then That” for outage restorations.
These are the absolute basics that need to come together days before the match starts, so one can Load test and prepare for the Cricket scale. I’ve simplified this for folks who routinely want to understand what it takes to manage this scale. There’s A LOT more that goes behind the scenes.
Want to know how we build this step-by-step? Chat with us? 👇
What makes Last9 special compared to others?
Last9 has a time series data warehouse that helps you store, manage, and efficiently query data. We call it Levitate.
Levitate has powerful features to help you save costs, manage scale and rein in cardinality.
Levitate tiers data into different categories, so querying is fast and doesn’t crash the system. We have Policies & Governance to structure and trim your data. Levitate’s alerting tools give you a simple vocabulary to understand critical infrastructure. And… There’s a lot more.
Levitate can crunch 5 trillion data samples across 30 days at a max 100ms latency. During the game days, it becomes the single most crucial and trusted pane to drive business goals.
For example, you want to change the payment provider during the live match because one has degraded and is failing customers. Being aware of these is the first part of the puzzle, then solving for these unpredictable outcomes makes Levitate a trustable tool to drive business goals.
Levitate is your war-time buddy during large-ticket events such as these.
It's difficult to grok such large amounts of data and understand when and where things can go wrong. Here’s an anecdotal example of how we're able to do this at scale — Shannon Limits and engineering reliability
Want to know more about Last9 and our products? Check out last9.io; we're building Reliability tools to make running systems at scale, fun, and embarrassingly easy. 🟢