One of the recurring problems I’ve noticed with potential customers we speak to is, Cardinality. Over the last 3 years of working at Last9, this is one word that’s come up more times in the last 9 months than the previous 2 years put together. High Cardinality is a pressing problem; and one that seems to be getting out of control for most Site Reliability Engineering (SRE) teams.
There’s a tonne of material only on High Cardinality online. But in keeping with my theme of simplifying complex information, (Reliability engineering for dummies) here’s an, Explain It Like I’m 5 (ELI5), and this time, we’re talking about a problem that’s plaguing the Observability industry — High Cardinality.
What is High Cardinality?
The explosion of unique data combinations in a time series database is High Cardinality. As the number of unique values increases, so does the total number of combinations. This increases the cardinality of data. The more information you want to track, the more different data combinations you will have to instrument. Organizing different combinations of data gets harder because of the sheer volume of unique data points being managed. Eventually, you have High Cardinality data.
I’ve taken some liberties with that explanation to make it as ELI5 as possible. I know. It’s still all theoretical. The best way to understand this is with an example.
And what better way to talk about High Cardinality than…books 😊
Watchmen — A shelf to figure out Cardinality
Let’s say you have a whole bunch of books and want to arrange it in your bookshelf. How do you go about arranging these books? Let’s use this analogy loosely to understand High Cardinality.
- I have a friend who arranges her entire bookshelf based on just the ‘color’ of the cover of the book. If you do this, I have questions. Starting with, ‘What happened to you’ 😜 .
- Another friend arranges his bookshelf based on how ‘worn out’ the book is. If it’s tattered, it’s on the lower shelves, away from someone’s eye line.
- I arrange mine alphabetically based on the author’s first name. Plain and simple. My college mate does something similar, but the second name of the author takes precedence.
Then, there’s a colleague of mine:
He arranges books based on the first name of the author, but…
A. It’s divided based on the genre: Thriller, Fantasy, History, Satire. But…
B. He further divides it into male and female authors. But…
C. He also wants to color-code it after the above two segregations. But…
D. It’s again further subdivided into how worn out it is.
You see where I’m going with this? It’s a step-by-step process with priorities and filters. A has a higher priority than D., And C is impossible without ticking off B. And it goes on.
Something as simple as arranging books on a shelf gets complicated the more filters you have. Let’s delve deeper to understand how this can be translated to High Cardinality.
A case of exploding mangoes — Understanding High Cardinality
Given the explosion of unique attributes for each book, it’s hard to manage Cardinality. Take any book for example. Let’s take genres as a key label. Even if you take just 4 of them;
Productivity | Sci-Fi | History | Satire |
---|
Your Cardinality = 4.
Let’s add another metric into the fold: Durability of the book — Worn out, not worn out.
Genre | Productivity | Sci-Fi | History | Satire |
---|---|---|---|---|
Durability | Yes | Yes | Yes | Yes |
Durability | No | No | No | No |
Your Cardinality = 8.
Hmmm… What if we add another label into the fold: ‘Aliens’ — Does it have alien civilizations or not?
What happens to your Cardinality now? Think about that carefully before you proceed further.
If you guessed 16, you’re wrong.
Another guess?
Here’s a clue:
The Alien label applies only to the Sci-Fi genre.
Take another guess.
The answer is;
Your Cardinality = 10.
Here’s what that looks like:
Genre | Productivity | Sci-Fi | History | Satire |
---|---|---|---|---|
Durability | Yes | Yes | Yes | Yes |
Durability | No | No | No | No |
Alien | Yes | |||
Alien | No |
For all those empty boxes, the Alien metric simply does not apply. This particular label is only applicable to one genre: Sci-Fi. And suddenly, you realize how things can get crazy.
Imagine increasing the number of labels = genres.
Imagine increasing the number of variables in this list, such as Length: Under 300 pages, Weight: Under 200, True Story, etc… Suddenly, the dimensions explode. The possibilities are endless. This is High Cardinality.
And High Cardinality is inevitable. There’s no point trying to ‘manage’ it, one must simply live with the modern realities of a cloud-native world. What we can do, is to better instrument our data, get more organized and prepare for what is an inevitability.
The Anarchy — is High Cardinality good?
There’s no good or bad with high cardinality. It’s a given in today’s cloud-native world. All our energy must be focused on how we can deal with this inevitability.
But, before we get some answers, let me explain this in real-world terms, without examples. This will give you an idea of the problem with High Cardinality.
Context: I’m an engineer at a food delivery company.
Daily Active Users: 20 Million
Average-Orders-per-user-per-day: 3
Metric:
I have a metric called order_total which have the labels order_id, user_id and status; (accepted | pending | completed | failed)
My system generates a unique order_id for each new order. Everything gets tracked in a database.
Total Unique possible values:
- order_id = 3 (per user per day)
- user_id = 20M (per day)
- status = 4 (accepted, pending, completed, failed)
- Combined = 32040 = 240M (per user per day)
How do we read this?
Let’s query this data.
For example, We want to look at the status of all successful orders
query: sum(order_total { status=”accepted” }[1m]) by (order_id, user_id)
Just to process this query, the system will have to scan 20M * 3 = 60M unique data points.
The Dashboard will inevitably crash because of High Cardinality.
Worse: Some of this data won’t even be ingested.
These are real-world problems on the floor engineers face every day because of cardinality explosion.
The Sympathiser — Save me from this metric explosion
Levitate, our time series database data warehouse, is designed to help you with your High Cardinality data. First, we give our customers superior defaults so data never gets dropped. 👇
We also use something called ‘Streaming Aggregations’ to tame this data explosion. (More on this later)
For the curious lot who want to deep-dive and understand how we solve this using Streaming Aggregations, check out this piece from Piyush, my bossman and our CTO 👇
If you’re facing High Cardinality problems (I know you are; I can guarantee it 😜) check out our page for a more succinct explanation, and book a call with us here. Allow us to Levitate your woes 😉
P.S.
- I loved reading Watchmen — it’s a phenomenal graphic novel with such profound observations.
- Highly recommend you read ‘A case of exploding mangoes’; it’s biting satire and stellar prose
- The Anarchy is a non-fictional book on the East India Company and its marauding empire.
- The Sympathizer is a towering book being made into a series with Robert Downing Jr. in it. This one is going to be a riot: funny, mischievous, thrilling, and ironic, all packed into one.
Special thanks to Aniket and Sohom for vetting and helping with this story. Also, Chronosphere has an ELI5 on the topic, which I enjoyed reading.
Oh, also, join our Discord community to mingle with like-minded folks.