Sample vs Metrics vs Cardinality

When dealing with Time Series databases, I always got confused with Sample vs. Metrics vs. Cardinality. Here’s an explanation as I have understood it.

A sample is the most atomic record of observability that cannot be explored beyond its attributes.

A sample has a few key attributes.

Timestamp: Of the observation.
Value: Of the observation.
Name: What is being recorded?
Duration: Time between two recordings.

Dimensions are City: Austin, State: Texas, Country: USA, ZIP: 94043 Metric name is `total_water_consumed` The period is 19:00–19:10 on 20th January 2022

Key things

The data is not AT the time of observation, but the duration between two observations. Ex. 17th Jan 2022, 18:05 till 17th Jan 2022, 18:06
Some call this Period.
No matter your service, the data is always an aggregate for a minimum time range.

As your Observability grows, you face three key challenges.

More metrics are to be observed. Also called Coverage.
Same metric for more entities. Also called Cardinality.
Save metrics for a longer duration. Also called Retention.

Retention

As you start saving data for longer, at a given time, you have more historical data that can be used for trend analysis. But like any Data system, this adds additional strain to keep the system reliable for a larger volume of data.

We were earlier saving data for only 10 minutes. When we increase the retention to 30 minutes, 2 more samples will be preserved.

Cardinality

It is the number of unique attributes observed for a given metric. Cardinality is the most formidable challenge because the computation and memory needed to tackle queries grow exponentially.

We were monitoring total_water_consumed for a particular zip code, say we need 2 additional zip codes to be monitored, It would mean 2 more samples for the same metric + duration.

More Observability

Coverage increases when you observe more metrics or more dimensions. But, Saving more metrics does not mean better Observability; it may also just mean more unused data.

For the same duration AND for all the 3 ZIP Codes, we decide to observe total_power_consumed in addition to the total_water_consumed. Therefore, we will save 3 more samples, 1 for each zip_code AND timestamp.

Growth in Real Data

Modern Time Series systems don’t have to grow along a single axis of Cardinality, Coverage, or Retention alone. Instead, the rate of ingestion and exploration warrants an expansion on all three axes.

It’s crucial to understand how samples get involved while querying and the scale and performance limitations.

Say, I want water AND power consumption for 20 minutes but only for {austin, texas, USA, 94043}.

It will only involve one Dimension, two Coverage, and two Time units.

Say, I want ONLY water consumption for 20 minutes, but for {austin, texas, USA, 94043} AND {austin, texas, USA, 95058}

It only involves one Coverage, two Dimensions, and two Time units.

Conversely, If I want Power AND Water consumption across {austin, texas, USA, 94043} AND {austin, texas, USA, 95058} but ONLY for 10 minutes.

It will involve one Time, two Dimensions, and two Coverage units.

Whereas, If I want Water and Power consumed by all Females of {austin, texas, USA, 94043}, I won’t find any results.

Remember: A sample is the most atomic record of observability that cannot be explored beyond its attributes.

Concluding

For better observability, you need to have more exemplary attributes. Resulting in more samples.
For better coverage, you need more metrics. Resulting in more samples.
For historical knowledge, you need more retention. Resulting in more samples.

A typical time-series database is subjected to various workloads, all requesting different kinds and ranges of data.

Found this interesting? Got a different PoV? Reach out to me on Twitter — @realmeson10 . Oh, also, you should check out last9.io; we're building reliability tools to make running systems at scale, fun, and embarrassingly easy. Check us out. :)