Your availability dashboard looks great under load. The moment your services run clean — zero errors, everything healthy — the metrics disappear. Gaps in the chart. "No Data" where you expect 100%.
This isn't a configuration problem. It's how PromQL works, and once you understand it, the fix is three lines.
The Setup
Say you're tracking availability for a set of microservices using trace-derived metrics from OpenTelemetry. A standard SLI approach computes availability as:
Availability = (1 - error_rate) × 100Where error rate is the ratio of 5xx responses to total requests. A typical first attempt:
(1 - (
sum by (service_name) (
trace_endpoint_count{
service_name=~"auth-service|billing-service|api-gateway|user-service",
env="prod",
span_kind="SPAN_KIND_SERVER",
http_status_code!="",
http_status_code=~"5.*"
}
)
/
(
sum by (service_name) (
trace_endpoint_count{
service_name=~"auth-service|billing-service|api-gateway|user-service",
env="prod",
span_kind="SPAN_KIND_SERVER"
}
) + 0.0000001
)
)) * 100The + 0.0000001 in the denominator avoids division by zero. Looks reasonable.
The Problem
This query works when services are throwing 5xx errors. The moment a service has zero 5xx responses — the good scenario — the numerator returns no data. Not zero. Not 0. An empty instant vector.
In PromQL, when no time series matches a selector, the result is an empty set. Dividing an empty vector by anything produces another empty vector. The entire expression for that service evaluates to nothing, and your dashboard shows a gap.
This is especially painful on SLI dashboards where a missing data point triggers alerts or unsettles stakeholders — precisely when the service is healthiest.
Why PromQL Works This Way
PromQL is a set-based language. Every selector returns a set of time series. Arithmetic operators work on matching series across sets. If a series doesn't exist in one operand, there's nothing to match, so no result is produced.
This differs from SQL, where COUNT(*) on an empty result set returns 0. In PromQL, no matching series means no output. See the PromQL cheat sheet for a full reference on how vector matching works.
The Fix: The * 0 Fallback Pattern
Ensure the 5xx selector always returns a series — even when there are no 5xx errors — by using or with a zero-valued version of a series you know exists (total requests):
(1 - (
sum by (service_name) (
trace_endpoint_count{
service_name=~"auth-service|billing-service|api-gateway|user-service",
env="prod",
span_kind="SPAN_KIND_SERVER",
http_status_code!="",
http_status_code=~"5.*"
}
or
trace_endpoint_count{
service_name=~"auth-service|billing-service|api-gateway|user-service",
env="prod",
span_kind="SPAN_KIND_SERVER"
} * 0
)
/
(
sum by (service_name) (
trace_endpoint_count{
service_name=~"auth-service|billing-service|api-gateway|user-service",
env="prod",
span_kind="SPAN_KIND_SERVER"
}
) + 0.0000001
)
)) * 100How it works
The key addition is:
or
trace_endpoint_count{...all services, env="prod", span_kind="SPAN_KIND_SERVER"} * 0Step by step:
- 5xx errors exist: The first selector returns matching series.
orsees series already exist for those labels and ignores the fallback. Normal path. - No 5xx errors: The first selector returns nothing.
orkicks in and provides the fallback — total request count multiplied by zero. Produces a time series with the correctservice_namelabel and a value of0. sum by (service_name)collapses correctly, giving0for the error count.- Final result:
(1 - 0/total) * 100 = 100%— exactly right for a healthy service.
Why not or vector(0)?
vector(0) produces a scalar with no labels. When the denominator is grouped by (service_name), PromQL can't match the label-less 0 against labeled denominator series. You get a many-to-one matching error or wrong results.
The * 0 pattern preserves the original label set. The fallback series carries the same service_name, env, and other labels as the real data, so all grouping and matching works correctly.
Other Approaches
clamp_min(..., 0) — Sets a floor value on existing series. Doesn't help when the series doesn't exist at all.
Recording rules — Pre-compute error count with a recording rule that handles the zero case. Works but adds operational overhead and another artifact to maintain.
absent() function — Returns a series with value 1 when the input is empty. You could construct (your_query or absent(your_query) * 0), but absent() doesn't preserve labels well across grouped queries.
The * 0 pattern is the simplest approach that correctly handles labels without extra infrastructure.
The General Pattern
Any time you compute a ratio in PromQL where the numerator might legitimately be empty:
sum by (label) (
metric_with_specific_filter{...}
or
metric_with_broader_filter{...} * 0
)
/
sum by (label) (
metric_with_broader_filter{...}
)The broader filter should match a superset of what the specific filter matches — same labels, without the restrictive condition. * 0 gives you the right labels with a zero value. or only uses the fallback when the primary selector is empty.
Build SLIs That Hold Up in Production
Availability queries are the foundation of any practical SLO implementation. The * 0 pattern is one of several query-correctness issues that only surface in production — when traffic patterns hit edge cases your staging environment never saw.
If you're using OpenTelemetry trace-derived metrics, Last9 stores trace_endpoint_count natively from your OTLP pipeline — the query above works out of the box without recording rules or custom aggregations.
Get started with Last9 or check the OpenTelemetry integration docs.
