LowCardinality is one of the first optimizations people reach for in ClickHouse, and one of the easiest to misapply. Wrap a service_name column and your storage drops and your GROUP BY gets faster. Wrap a trace_id column with the same instinct and you have just added overhead with nothing to show for it.
We run ClickHouse in production to store high-cardinality telemetry at Last9, so we spend a lot of time deciding which columns deserve LowCardinality and which ones it will quietly hurt. This post is the operator version of that decision: what the type actually does, where it wins, and the specific observability columns where it is the wrong call.
What LowCardinality is
LowCardinality(T) is a wrapper around an existing data type, most often String. Instead of storing the raw value on every row, ClickHouse stores each distinct value once in a dictionary and then stores a small integer index per row that points into that dictionary.
So a column with one million rows but only twenty distinct values stops being one million strings on disk. It becomes twenty strings plus one million small integers. The integers compress far better than repeated text, and a lot of query work can happen on the compact integer representation instead of the full strings.
The official ClickHouse documentation on LowCardinality describes it as dictionary encoding applied transparently at the column level. You declare it in the schema and ClickHouse handles the encoding and decoding for you:
CREATE TABLE logs
(
timestamp DateTime,
level LowCardinality(String),
service LowCardinality(String),
region LowCardinality(String),
message String,
trace_id String
)
ENGINE = MergeTree
ORDER BY (service, timestamp);Here level, service, and region are good candidates. message and trace_id are deliberately left as plain String, and the rest of this post is about why.
How dictionary encoding works under the hood
The mechanics matter, because they explain both the wins and the failure modes.
ClickHouse builds the dictionary per part and processes data in blocks. For each block it keeps a dictionary of distinct values and replaces the actual values with positions in that dictionary. The position integers are sized to fit the dictionary, so a column with a handful of distinct values uses a very narrow integer per row.
Two consequences follow directly from this design:
- The benefit comes from repetition. The fewer distinct values relative to total rows, the more rows share each dictionary entry, and the more the integer indices compress.
- The dictionary is not free. ClickHouse has to store the dictionary, look up values through it, and keep it in memory during query execution. When the dictionary is small, that cost is trivial. When it is large, the bookkeeping starts to dominate.
ClickHouse also caps how large these dictionaries can grow through settings like low_cardinality_max_dictionary_size, which exists specifically to stop an out-of-control dictionary from eating memory. That cap is a hint about the intended use: this type was built for genuinely low-cardinality columns, not as a universal compression switch.
When LowCardinality wins
The rule of thumb from the ClickHouse documentation is that LowCardinality shows clear gains when a column has fewer than roughly 10,000 distinct values. Below that, you typically get both smaller storage and faster queries. The further below it you are, the bigger the win.
In observability data, plenty of columns fit comfortably:
- Log level:
info,warn,error,debug. A handful of values across billions of rows. - Service or job name: usually tens to low thousands, even in large fleets.
- Region or availability zone: a fixed, small set.
- HTTP method:
GET,POST,PUT, and a few others. - Status code: a bounded set of integers or short strings.
- Kubernetes namespace: typically bounded per cluster.
- Environment:
prod,staging,dev.
These columns share a property: the number of distinct values stays roughly flat no matter how much data you ingest. You can pour ten times more logs in and level still has the same handful of values. That is exactly the shape dictionary encoding rewards.
For these columns the win is real on two axes at once. Storage drops because repeated short strings collapse to integers, and queries that filter or group on them run faster because ClickHouse can work on the integer indices and skip a lot of string comparison.
The whiteboard test: if adding more data does not add more distinct values to the column, it is a LowCardinality candidate.
When LowCardinality hurts
This is the part most guides skip. LowCardinality is not free insurance. On the wrong column it costs you storage, memory, and query time while delivering no compression benefit.
The failure case is a column where the number of distinct values grows with the data. In observability, these are everywhere:
- Trace ID and span ID: effectively unique per request or operation.
- User ID, session ID, request ID: unbounded, growing with traffic.
- Full URLs or request paths with IDs in them: near-unique.
- Email addresses: high cardinality by nature.
- Raw log message bodies: almost always unique.
- IP addresses in many workloads: large and growing.
When you wrap one of these in LowCardinality, the dictionary tries to hold a huge and ever-growing set of distinct values. You pay to build and maintain a dictionary that is almost as large as the data it was supposed to compress, plus the index column on top. The compression ratio collapses toward nothing, and you have added dictionary lookups to every read.
The threshold is not a hard cliff, but the guidance is consistent across the ClickHouse docs and community write-ups: once a column climbs past roughly 10,000 distinct values, the benefit fades, and well past that point you are paying overhead for an optimization that no longer applies. For genuinely high-cardinality columns, a plain String with a good general-purpose codec usually beats LowCardinality.
The most expensive LowCardinality column is the one wrapped around data that was never low cardinality to begin with.
Observability-specific guidance
High-cardinality data is the defining problem of observability, which is exactly why LowCardinality needs a careful hand here. The same dataset usually contains both ideal candidates and terrible ones, sitting in adjacent columns.
A practical way to split your telemetry schema:
Wrap these (bounded, slow-growing): - level, severity - service, job, component - region, availability_zone, cluster - environment - http_method, status_code - namespace, deployment (bounded per cluster)
Leave these as plain types (unbounded, fast-growing): - trace_id, span_id - user_id, session_id, request_id - full url or path with embedded identifiers - raw message bodies - pod_id and other per-instance identifiers that churn (these look bounded at a snapshot but turn over constantly, so cumulative cardinality climbs)
That pod_id case is the one that trips people up. At any instant a cluster has a manageable number of pods, so the column looks like a LowCardinality candidate. But pods are recreated constantly, so over a retention window the set of distinct pod identifiers keeps growing. Cumulative cardinality, not point-in-time cardinality, is what the dictionary has to deal with.
What this looks like in our own logs table
We do not have to argue this in the abstract. Here is how columns compress in our production OTEL logs table, where the bounded dimensions are wrapped in LowCardinality(String) and the high-cardinality columns are left as plain String:
| Column | Type | Compression ratio |
|---|---|---|
ServiceName |
LowCardinality(String) |
~372x |
SeverityText |
LowCardinality(String) |
~376x |
Body |
String (raw log text) |
~24x |
TraceId |
String |
~25x |
The bounded columns we wrap compress more than an order of magnitude better than the high-cardinality columns we leave alone. TraceId is the instructive one: it is stored as a plain String and still compresses about 25x from the codec alone, but wrapping it in LowCardinality would not improve that, because nearly every value is unique. There is no repetition for a dictionary to exploit. So it stays a plain String, right next to ServiceName, which earns its LowCardinality wrapper many times over.
That is the entire decision in one table: wrap the columns that repeat, leave the ones that do not.
This column-by-column discipline is the same principle we apply to metrics, where a single careless label can blow up cardinality across an entire series. We went deep on how the two engines handle this in High Cardinality Metrics: How Prometheus and ClickHouse Handle Scale, and on the Prometheus side specifically in How to Manage High Cardinality Metrics in Prometheus. If you want to try any of this locally, our guide to setting up ClickHouse with Docker Compose gets you a working instance in a few minutes.
How to check before you wrap
You do not have to guess. Measure the cardinality of a column before deciding:
SELECT
uniqExact(service) AS distinct_services,
uniqExact(trace_id) AS distinct_trace_ids,
count() AS total_rows
FROM logs;If distinct_services is in the hundreds against billions of rows, wrap it. If distinct_trace_ids tracks total_rows, leave it alone. For a quick read on a large table, uniq() is the approximate, cheaper version of uniqExact() and is usually accurate enough for this decision.
When you are unsure, test both. Create two versions of the column, load a representative sample, and compare the on-disk size with system.columns. The numbers settle the argument faster than any rule of thumb.
The takeaway
LowCardinality is a precise tool, not a default. It pays off handsomely on columns where the set of distinct values stays small and flat as data grows, which describes a lot of the dimensions in telemetry data. It costs you on columns where distinct values grow with volume, which describes most of the identifiers in that same data.
Wrap the bounded columns. Leave the unbounded ones as plain types. Measure when you are not sure. That is the whole discipline.
If you are running into high-cardinality pain in observability data more broadly, that is the problem Last9 is built to absorb. We store high-cardinality telemetry without sampling or pre-aggregation, so you can keep the dimensions you need instead of dropping them to stay afloat. See how Last9 handles high cardinality. No lock-in.
FAQ
What is LowCardinality in ClickHouse?
LowCardinality is a data type wrapper in ClickHouse that applies dictionary encoding to a column. Instead of storing the raw value on every row, ClickHouse stores each distinct value once in a dictionary and a small integer index per row. It reduces storage and speeds up queries on columns that have a small number of distinct values.
When should I use LowCardinality?
Use LowCardinality on columns with fewer than roughly 10,000 distinct values where that count stays flat as data grows. Good examples are log level, service name, region, HTTP method, status code, and environment. These give both smaller storage and faster filtering and grouping.
When should I not use LowCardinality?
Avoid LowCardinality on high-cardinality columns whose distinct values grow with the data, such as trace IDs, span IDs, user IDs, session IDs, full URLs, email addresses, and raw log messages. On these columns the dictionary becomes large, the compression benefit disappears, and you pay overhead for an optimization that no longer applies.
What is the cardinality threshold for LowCardinality?
The ClickHouse documentation suggests LowCardinality is most effective below roughly 10,000 distinct values per column. It is a guideline rather than a hard limit. The benefit fades gradually as cardinality rises, and well past that point a plain String with a general-purpose codec usually performs better.
Does LowCardinality always save storage?
No. It saves storage only when values repeat heavily. On a column where most values are unique, the dictionary is nearly as large as the data plus the added index column, so storage can grow rather than shrink. Repetition is what makes the encoding pay off.
How do I check a column's cardinality in ClickHouse?
Use uniqExact(column) for an exact distinct count or uniq(column) for a faster approximate count, compared against count() for total rows. If the distinct count is small relative to total rows, the column is a good LowCardinality candidate. If it tracks the row count, leave it as a plain type.
Is LowCardinality good for observability data?
It is good for the bounded dimensions in observability data, such as level, service, region, and status code, and bad for the unbounded identifiers, such as trace IDs and request IDs. Observability schemas usually contain both kinds of columns side by side, so apply it column by column rather than across the whole table.
What about pod_id and other Kubernetes identifiers?
Be careful. A pod_id column looks low cardinality at any single moment because a cluster has a bounded number of pods, but pods are recreated constantly, so the cumulative set of distinct pod identifiers grows over a retention window. Judge it by cumulative cardinality over time, not point-in-time cardinality, and in most cases leave it as a plain type.
