Reliable Observability for 25+ million concurrent live-streaming viewers
Download PDF- Video Streaming
- 300+ engineers
- APAC
- Amazon Web Services
Last9 works with some of the world’s largest streaming companies. One of our customers shows movies, TV shows, and large-scale sporting events for its millions of subscribers. At Cricket Scale hundreds, not dozens, of microservices harmoniously stitch together compelling user experiences that keep viewers glued to their devices.
No dropped frames, never miss a moment
A cricket match can have over 25 million concurrent viewers. Games last about 3-4 hours, and systems are warmed up hours in advance in anticipation of the sudden surge in traffic. Significant ephemeral resources come online to last the game’s duration and be torn down soon after. Hundreds of engineers work on backend services and a robust infrastructure to enable the live-streaming of such high-ticketing events.
When something goes wrong, the war room team needs to immediately isolate the issue. The next step is to draft a Root Cause Analysis (RCA) and route it to the appropriate team for further investigation. Every additional second that is taken to diagnose a problem adversely impacts advertising revenue. Above all, it causes massive viewer dissatisfaction, given the criticality of missing a major sporting memory. Teams often find themselves investigating leading indicators of failure. These problems manifest on social media platforms, and before they spread, need to be triaged and fixed.
Growing Pains
Scale
Scaling in-house metrics with business growth
Uptime
Maintaining uptime and query guarantees
Toil
Managing a TSDB instead of focusing on Product
Standardization
Standard Telemetry across teams
The Multifaceted Infrastructure Platform
A diverse and complex infrastructure platform powers our customer’s scale. Hundreds of microservices and a variety of data stores handle persistent data storage; some are fully managed by Cloud Hyperscalers, and the in-house team manages others. The scale-up necessitated by such games results in many ephemeral resources coming online. Consistent and uniform Observability across these disparate sources is incredibly challenging; to observe their health all of these sources continuously emit metrics and in plenty. Over time, the team’s engineers noticed a sprawl of metrics and monitoring techniques, making it hard to standardize the telemetry and monitoring.
The team was using an in-house setup based on VictoriaMetrics (a popular open-source time series database) and InfluxDB for metrics management. For visualizing data and managing alerts, Grafana was used.
Growing scale and concurrent access woes
Thousands of dashboards were created by multiple teams presenting unique challenges. Grafana dashboards and alerts were concurrently accessing the same underlying metrics storage. These underlying databases could not keep up with massive ingestion and simultaneous queries. Inevitably, the storage would ultimately go down leaving the teams oblivious to the health of their infrastructure. Instead of focusing on features and innovating on the product, the engineering team spent countless hours keeping the Observability platform up.
To reliably support the team’s incredible infrastructure growth, they needed a next-generation Observability platform. Given their unique challenges and incredible scale, the team needed a product that could withstand “cricket scale”, sustain uptime, be globally available, and not explode costs.
The Last9 Advantage
Open Standards
Zero integration efforts
Superfast Ingestion
50% reduction in write latency
Data Tiering
Solves concurrent access woes and powers long term retention
Last9 is a globally available time series & events telemetry data platform designed for scale, high cardinality and long term retention.
Open Standards
Last9 ingests data from multiple open standards, such as Prometheus exposition, OpenTelemetry Metrics, OpenMetrics, and InfluxDB. This ensured no migration effort was needed at our customer’s scale of hundreds of micro-services. Hundreds of engineers were onboarded to existing and new workloads on Last9 within weeks, given interoperability and ease of integration. Since Last9 is fully compatible with Open standards on the output layer, the team could keep using their existing dashboards and alerting workflows.
Within a month, Last9 was the source of truth for all metrics workloads across our customers’ teams.
SLA Guarantees
Last9 is a managed service with Service Level Agreement (SLA) guarantees and clawbacks for both Read and Write workloads. This eliminated the toil and upkeep to manage and scale our customer’s in-house metrics setup.
Last9's Availability SLA Guarantees
Long Term Retention
With the previous in-house metrics setup, teams could not retain data beyond a month for critical analysis. Imagine having billions of data points of consumer behavior, but being unable to use them for growing business needs.
Last9's automatic data tiering and retention policies paved the way for long-term time series storage. This helped the team with capacity planning and business insights year after year. By default, the latest data is available in all tiers, but their retention policies vary.
Last9's Default Tier Retention
Last9's Data Tiering capability is also used on the query layer, creating policies for accessing the Blaze tier only for alerting. The other tiers can then be used for deeper exploration and analysis. This resolved the concurrent access issue they faced with the in-house metric setup.
Observability is a foundational building block and can unlock much goodness — however, it’s deviously complex to get right. The founders at Last9, aptly named, have been amazing partners in trying to make inroads on what a solid observability platform should be and hit most, if not all, of the building blocks. Read More ↗
Key Results
Single Source of Truth
Single data source for all metrics workloads
Zero Toil, Better Performance
No toil of managing an in-house TSDB
Reduced TCO
Total Cost of Ownership reduced by 50%
Last9 has improved query speeds, dramatically reduced the Total Cost of Ownership (TCO) by 50%, and is currently the bedrock for the customers’ entire infrastructure.
Bring Your Own Cloud Model
Last9 comes with a Bring Your Own Cloud (BYOC) model - we can deploy in our customers’ cloud directly offering all the features Last9 comes with.
Last9 is currently the bedrock for the customers’ entire infrastructure.
With optimized auto-tiered storage, warehousing control levers, and availability guarantees, we’ve reduced the toil of managing a time series database and the engineering overheads that come with it — something seldom factored in while calculating the cost of running your own Observability team.
Talk to the Last9 team to understand how Last9 can unlock value for you as well. Get a demo or get started today.
Handcrafted Related Posts
Take back control of your Monitoring
Take back control of your Monitoring with Levitate - a managed time series data warehouse
Nishant Modak
Observability—OSS vs Paid vs Managed OSS
The Reliability industry needs a managed, non-vendor lock-in answer to spiraling costs, high cardinality and the toil of managing a tsdb
Satyajeet Jadhav
Understanding “Cricket Scale”
How does a DevOps/Site Reliability Engineer plan for "Cricket scale"? How do you warm systems' about to witness 30+ million concurrent users?
Aniket Rao
Do more with less.
Unlock high cardinality monitoring for your teams.