In our last piece, we talked about high cardinality, its impact on your observability setup, and why it doesn’t have to be seen as a problem. In this one, we’ll look at how cardinality and dimensionality are connected, some practical tips for managing high cardinality, and a few best practices for building observability systems that scale smoothly.
What is Dimensionality?
In monitoring, dimensionality means the labels you tag onto your metrics to give them context. Think of them as key-value pairs that help you filter and group your data:
- service_name=“auth-api”
- instance_id=“i-08f557b8d2”
- environment=“prod”
- region=“us-west-2”
- customer_id=“12345”
- request_path=“/api/v1/users”
These tags let you run queries like “show me error rates for the auth-api in prod” or “what’s the p95 latency in us-west-2?” The problem? Add too many tags and you’ll blow up your storage and make queries crawl.
A Quick Comparison: Dimensionality vs. Cardinality
Aspect | Cardinality | Dimensionality |
---|---|---|
Definition | How many unique values exist for a dimension | How many different labels you attach to metrics |
Example | customer_id might have 1M+ distinct values | Adding dimensions like region, instance_id, path |
Technical impact | High-cardinality fields create massive index tables | Each dimension multiplies your possible time series |
Query performance | High cardinality = larger scans = slower queries | More dimensions = more complex query planning |
What Happens When Cardinality Meets Dimensionality
Here’s where things get complicated. High cardinality combined with multiple dimensions creates a combinatorial explosion:
Consider tracking API latency with these dimensions:
- customer_id (100K values)
- endpoint (50 values)
- region (5 values)
- instance_id (100 values)
Simple math: 100,000 × 50 × 5 × 100 = 2.5 billion unique time series for just one metric.
Each dimension you add multiplies your storage needs and can turn millisecond queries into minutes-long operations that timeout. Your monitoring system can quickly become more expensive than the application you’re monitoring.
The Dimensionality Challenge for Time Series Databases
Understanding the relationship between cardinality and dimensionality is crucial because these concepts directly impact how time series databases (TSDBs) function. TSDBs are designed to efficiently store and query metrics over time, but they face unique challenges when dealing with high-cardinality data.
When dimensions multiply and cardinality explodes as we’ve described, TSDBs must work harder to index, compress, and query this data. The exponential growth in time series doesn’t just consume storage—it fundamentally changes how these specialized databases perform. While traditional databases might struggle with high cardinality in general, the problem is amplified in TSDBs because they’re optimized for time-based queries across numerous time series.
So, how can we manage this explosion of data while still maintaining the performance and insights we need? Let’s explore some practical strategies for handling high cardinality in TSDBs without sacrificing observability.
-
Dropping High-Cardinality Labels at the Source
One option is to drop high-cardinality labels right at the source to prevent the explosion in the first place. While this helps avoid the data bloat, it defeats the purpose of detailed monitoring — you’re essentially throwing away valuable context.
-
Pre-computing Common Queries
Another approach is to pre-compute common queries instead of calculating them on the fly. This is especially useful for dashboards and monitoring systems that refresh frequently. By storing the results as time series, you can save CPU cycles and significantly boost performance. Pre-computing metrics also reduces the need for CPU-intensive scans, preventing overload on time series backends and improving overall system efficiency.
There are two main ways to pre-compute your time series data:- Recording Rules: These allow you to perform advanced operations on metrics after they’ve been ingested and stored, using PromQL-based rules.
- Streaming Aggregation: This runs aggregation rules on raw data before it’s stored, reducing the amount of data saved from the start.
Let’s break down these methods further.
Recording Rules: Pre-calculating Metrics in Prometheus
Recording rules are a Prometheus feature that enables you to define new time series based on existing ones. These rules allow you to aggregate, filter, and transform your metrics into something more useful, and then query them via PromQL.
Key Consideration: Recording rules run after data is ingested, meaning they introduce a performance cost. Storing high cardinality metrics may not be the issue, but reading and generating new metrics from them is. These operations use CPU and compute resources, and if the rules are still being evaluated when queries are made, they can cause delays and impact performance.
💡 Pro Tip: Recording rules are commonly used in the Prometheus ecosystem. Tools like Sloth, a Prometheus SLO generator, rely heavily on these rules. If you’re using third-party tools, make sure to audit the recording rules they add—they could impact your time series backend’s performance!
Example:
Let’s say you want to aggregate metrics from multiple targets into a single metric. For example, you want to track the total request rate for a specific endpoint across all targets. Here’s how to set it up in Prometheus:
Define the aggregation: You want to sum the request rates for the /api/v1/status endpoint across all targets. The query would look something like this:
http_requests_total{endpoint=“/api/v1/status”} = sum(rate(http_requests_total{endpoint=“/api/v1/status”}[5m])) by (job)
This calculates the total request rate for the /api/v1/status endpoint using a 5-minute window, grouped by job.
Create the recording rule: Now, set up the rule in your Prometheus config:
groups: - name: http_requests rules: - record: http_requests_total:sum expr: sum(rate(http_requests_total{endpoint="/api/v1/status"}[5m])) by (job)
This rule calculates the total request rate and stores it in a new metric called http_requests_total:sum. Now, you can query http_requests_total:sum directly for a simpler, faster result.
Streaming Aggregation: Real-time Data Optimization
Streaming aggregation handles high-cardinality data before it’s stored. As data comes in, it is processed in real-time, and necessary aggregations are performed immediately. This prevents excessive time series from accumulating in storage, which could otherwise slow down your queries and overload the system.
The benefits are clear:
- Better Control: It helps manage the volume of high-cardinality data.
- Speed: Aggregating on the fly ensures faster queries and quicker responses.
- Cost Efficiency: By reducing the amount of data that needs to be processed, it can help reduce costs, particularly in cloud environments.
Instead of dealing with millions of unique time series, streaming aggregation allows you to simplify queries (e.g., sum(grpc_client_handling_seconds_bucket{})) without all those extra labels bogging down performance.
Recording Rules vs. Streaming Aggregation: Which One Should You Choose?
Here’s a comparison of the two approaches to help you decide which one best fits your needs:
What Matters | Recording Rules | Streaming Aggregation |
---|---|---|
When does it happen? | After the data is ingested | During data ingestion |
Performance hit | Yes | None |
Reliability | Fewer metrics stored, but more data processed | Less data stored = less risk |
Scalability | Reduces storage needs but doesn’t eliminate scaling concerns | Handles huge volumes efficiently |
Performance | Faster by processing less | Faster by storing less |
Time to insights | Actionable, meaningful metrics | Real-time insights |
Trade-offs | More expensive but lossless | Cheaper, but might lose some data |
The Trade-off: Cost vs. Data Fidelity
When choosing between Streaming Aggregation and Recording Rules, the main trade-off is cost versus data fidelity.
Recording Rules, on the other hand, preserve data fidelity by working with already ingested data, ensuring no points are lost. However, they come with a performance cost, as they require additional computing resources to process and store results, which can introduce latency, especially with large datasets.
Streaming Aggregation is cost-efficient for managing high-cardinality metrics. By processing data as it arrives, it reduces the number of unique time series, cutting both storage and compute costs. However, this can lead to a loss of data fidelity if the original metric is discarded. Users can still choose to store the original metric and use the generated aggregated metric for most queries, balancing cost savings with the ability to maintain finer data details when needed.
In summary, if you’re focused on cost efficiency and faster processing, streaming aggregation is the way to go. But if detailed data is critical and you can manage the added processing cost, recording rules offer more precise insights, though with a slight hit to performance.
You can also combine both approaches for the best of both worlds. For instance, streaming aggregation can handle high-cardinality metrics, like user_id or session_id, by aggregating them in real time. This reduces the number of unique time series, saving on storage and compute.
Afterward, you can apply recording rules to generate summary metrics, such as total_user_sessions, based on the aggregated data. This lets you keep the high-level overview while still having access to the original data for deeper dives when necessary.
Using both methods allows you to balance performance, cost, and data detail, making your observability setup both efficient and flexible.
Best Practices for Sustainable and Scalable Observability
Solving cardinality issues once is just the beginning. To prevent future explosions in data, you need guardrails and ongoing practices to keep things under control.
1. Design With Cardinality in Mind From Day One
- Create clear labeling guidelines: Define rules on which labels should be used and which ones should be avoided.
- Educate your team: Make sure everyone understands the impact of adding new labels and how they can affect the system.
- Audit regularly: Periodically review metrics for unused or redundant labels and clean them up.
2. Monitor Your Monitoring
- Track time series growth: Monitor the number of unique time series over time to catch spikes early.
- Set up alerts for cardinality changes: Get notified when the number of unique time series or dimensions grows unexpectedly.
- Measure your monitoring system’s performance: Keep an eye on how your observability tools are performing and adjust accordingly.
3. Set Boundaries
- Implement hard limits: Define maximum thresholds for the number of time series or unique labels.
- Use soft warnings: Alert your team when they’re approaching these predefined limits.
- Have fallback plans: When the system is under pressure, consider switching to lower-cardinality versions of key metrics to ease the load.
4. Balance Detail and Practicality
- Prioritize what matters: Focus on dimensions that provide meaningful insights into your system.
- Think long-term: Consider whether a particular metric will still be useful in six months. If not, it might not be worth tracking.
- Use tiered retention: Keep high-cardinality data for shorter periods and aggregate data for longer-term storage.
5. Clean House Regularly
- Audit your metrics: Regularly check which metrics are still relevant and actively used, and remove the ones that aren’t.
- Remove outdated metrics: Don’t hesitate to delete metrics that no longer serve a purpose in your observability efforts.
- Standardize naming conventions: Consistent naming prevents duplication and confusion, making it easier to manage your data.
Manage High Cardinality With Last9—Without Breaking a Sweat
High cardinality and dimensionality are often treated as a headache, but with the right approach and the right tools, such as Last9, they can work in your favor.
At Last9, we help teams manage high-cardinality data at scale, without relying on sampling. Our built-in aggregation controls and Cardinality Explorer make it easy to spot trends, catch issues early, and maintain a smooth-running observability stack.
-
Need to clean up noisy metrics? Quickly identify metrics approaching or exceeding cardinality limits with detailed reports using Cardinality Explorer.
-
Want faster queries? Streaming Aggregations allow you to drop unnecessary labels or create lighter, scoped metrics just when you need them.
With a default quota of 20M time series per metric, you can manage high cardinality effortlessly. This means your observability system stays efficient, even as your infrastructure scales.
Now that we know how dimensionality and cardinality work together, our next piece will focus on streaming aggregation. We’ll break down how this method can help you manage high cardinality efficiently, without slowing down your system or getting buried in data.