Enhancing Metric Gateway Performance

This release aims to improve the metric gateway with a more performant and accurate Cardinality Limiter.

Key Features and Improvements

More Accurate Series Limiter

The new algorithm precisely monitors the time series per metric per counter. Consequently, even the already ingested time series would be prevented from being written during a surge. This limitation is due to how probabilistic data structures function and their space requirements at our scale which are fixed in this release.

Cardinality Metric Reporting Improvements

Constantly reporting cardinality numbers at our scale is expensive. Now, the system provides more timely information for metrics when they hit cardinality and when the quota resets to 0.

Resource Efficiency & Threshold Heuristics

Counting distinct and unique sets across a large stream is challenging and consumes a significant amount of memory. Maintaining 1 million hash values alone requires 400 MB of memory. For clusters and tenants, this problem is compounded, making it even slower. However, a new algorithm has been developed that initiates counting only when metrics cross a heuristic threshold, which indicates that the metric may exceed the cardinality quota. This approach dramatically reduces the resources required.

Lower Reset Spikes in Resource Usage and Latency

The new cardinality limiter does not follow a stop-the-world design. It gracefully rests in the background and does not bring existing writes to a halt. This prevents P99 requests from timing out, resulting in no backpressure on remote-write clients sending data to Levitate.

Stop Ingesting Metric After Limit is Reached

When the cardinality limit for a specific metric is reached, a conditional switch halts the ingestion of that metric. This measure is taken to ensure the accuracy and integrity of the metric data, as partial ingestion results in partial histograms that yield meaningless query results. It's best to halt ingestion and encourage using high-cardinality workflows, such as relabeling and streaming aggregation, to handle high-cardinality metrics meaningfully.

Please read the blog post below to learn more about our design choices in achieving these results.

Changelog