Choosing the wrong metric type in Prometheus can lead to inaccurate dashboards, false positives in alerting, and missed indicators of system failure. Gauge metrics are intended for tracking values that can go up and down, such as memory usage, queue depth, or the number of active connections.
Unlike counters, which only increment (or reset on restart), gauges reflect the current state of a resource at scrape time. Misusing gauges, or using them without proper lifecycle handling, can result in stale data, incorrect assumptions about system health, and alerts that trigger without a real issue.
This blog covers when to use gauges instead of counters, how to instrument them correctly, and implementation patterns that surface system issues earlier and more reliably.
Common Mistakes That Break Gauge Metrics in Production
Before getting into implementation, it’s worth understanding the most frequent (and costly) misuse patterns developers run into when working with Prometheus gauges:
- Treating gauges like counters
UsingInc()
to track events — such as HTTP requests — breaks reset detection and invalidatesrate()
orincrease()
calculations. Gauges don't track history, so you lose the ability to compute accurate trends. - Relying on stale values
If a service crashes or stops updating metrics, Prometheus continues scraping the last reported value. Dashboards and alerts may appear normal, even though the data is no longer valid. - High-cardinality label usage
Adding labels likeuser_id
,request_path
, orsession_id
to gauges creates a separate time series for each unique combination. This can lead to unbounded memory growth and slow query performance. - Missing initialization on startup
If a gauge isn’t explicitly reset or set when the service starts, Prometheus may retain the last value from a previous run. This can make current-state metrics incorrect or misleading.
Metric Types in Prometheus
Prometheus supports four core metric types: counter, gauge, histogram, and summary. Each serves a different purpose and affects how you write queries.
- Counter: A cumulative value that only increases (or resets on restart). Suitable for metrics like
total_requests_handled
orerrors_total
. - Gauge: Represents a value that can increase or decrease, such as
memory_usage_bytes
,cpu_temperature
, orqueue_length
. - Histogram: Samples observations into configurable buckets to approximate distributions (e.g., request durations). Requires
rate()
for meaningful query patterns. - Summary: Similar to histograms but pre-computes quantiles on the client side. Less commonly used due to limitations in aggregation and scaling.
Choosing the correct type is essential; it directly impacts how metrics behave over time, how they’re queried, and how accurately they reflect system state.
Here’s a quick comparison of the four Prometheus metric types:
Metric Type | Value Behavior | Common Use Cases | Query Patterns | Notes |
---|---|---|---|---|
Counter | Increases only (resets on restart) | Requests served, errors, jobs processed | rate() , increase() |
Good for tracking totals. Must use rate-based functions to be useful. |
Gauge | Goes up and down | Memory usage, queue depth, temperature | Direct queries (> , avg() , etc.) |
Use for current-state metrics. Must manage staleness and resets. |
Histogram | Buckets values into ranges | Request durations, payload sizes | rate(metric_bucket[5m]) + histogram_quantile() |
Good for distribution analysis. Requires post-processing in PromQL. |
Summary | Calculates quantiles locally | Latency quantiles, custom percentiles | Limited query support | Not aggregatable across instances. Use with caution at scale. |
Gauges vs. Counters: Behavior and Reset Semantics
Prometheus supports multiple metric types, but counters and gauges are the most commonly used. Understanding how they behave, especially during application restarts, is key to writing correct queries and avoiding misleading data.
Counter Behavior
Counters are monotonic: they only increase and reset to zero when the application restarts.
http_requests_total{method="GET"} 1547
Prometheus handles resets by detecting a drop in value between scrapes. In most cases, this works well, but there are edge cases:
- The previous value was low, making the drop ambiguous.
- Multiple restarts happen between scrapes.
- Clock skew affects timestamp order.
Because counters accumulate over time, you typically don’t query their raw values. Instead, use rate functions:
rate(http_requests_total[5m]) # requests per second
increase(http_requests_total[1h]) # total requests in the last hour
Gauge Behavior
Gauges represent a value that can go up or down, they capture the current state of something that changes over time.
memory_usage_bytes{instance="web-01"} 2147483648
active_connections{service="api"} 42
queue_depth{queue="orders"} 15
Gauges are set explicitly by your application, so they don’t reset automatically on restart. If a gauge isn’t reinitialized, Prometheus may continue scraping an outdated value or drop the time series altogether, depending on scrape timing.
This makes lifecycle handling critical:
- Always set gauges on startup.
- Don’t assume zero is the default.
- If your update loop crashes or stalls, the value will freeze.
Recovery Strategies for Reset Handling
To handle edge cases where counter reset detection or gauge freshness breaks down:
- Use
resets(metric_name)
to count how often a counter reset is detected. - Monitor the
up
metric to detect if the scrape target is down or stale. - Set gauge values explicitly on startup and at regular intervals.
- Use alerting logic that checks for stale data (e.g., using
deriv()
ortimestamp()
).
Querying Gauges vs. Counters
Gauges: Direct queries are usually sufficient.
avg(memory_usage_bytes) # average across instances
max(cpu_usage_percent) # peak CPU
queue_depth{queue="payments"} > 100 # alert threshold
Counters: Always use rate-based functions for meaningful insights.
rate(http_requests_total[5m])
increase(errors_total[30m])
The key difference: counters track events over time, while gauges report instantaneous values. Choose based on whether you care about how much something happened vs. where it stands right now.
How to Instrument Gauges in Your Application (By Language)
Gauges must be updated explicitly by your application. Prometheus clients exist for most languages, but each has slightly different patterns for setting and exposing gauge values.
Go: Set Values Explicitly
In Go, use the prometheus/client_golang
library. Define a GaugeVec
for labeled metrics and update it using .Set()
when the value changes.
var queueDepth = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "message_queue_depth",
Help: "Current number of messages in queue",
},
[]string{"queue_name", "priority"},
)
queueDepth.WithLabelValues("orders", "high").Set(float64(len(highPriorityQueue)))
Use .Set()
when your application can retrieve the current value directly, such as from in-memory queues or counters.
Python: Direct Value Updates
With prometheus_client
, define the gauge and call .set()
directly.
from prometheus_client import Gauge
queue_depth = Gauge('message_queue_depth', 'Queue size', ['queue_name', 'priority'])
queue_depth.labels(queue_name='orders', priority='high').set(len(high_priority_queue))
For system metrics, use a library like psutil
to pull values:
import psutil
memory_gauge = Gauge('memory_usage_bytes', 'Used memory in bytes')
memory_gauge.set(psutil.virtual_memory().used)
Schedule updates using a background loop or a scheduled job if the value changes over time.
Java (Micrometer): Register with a Callback
Micrometer gauges typically use a function that returns the current value. This callback is evaluated at scrape time.
@Component
public class QueueMetrics {
private final Queue<Message> messageQueue = new ConcurrentLinkedQueue<>();
public QueueMetrics(MeterRegistry meterRegistry) {
Gauge.builder("message_queue_depth", messageQueue, Queue::size)
.description("Current number of messages in queue")
.tags("queue_name", "orders")
.register(meterRegistry);
}
}
This approach works well for metrics that can be derived from the current in-memory state.
Node.js: Set in Loops or Event Handlers
The prom-client
package uses .set()
to update values. These updates can happen during events or at fixed intervals.
const client = require('prom-client');
const queueDepthGauge = new client.Gauge({
name: 'message_queue_depth',
help: 'Queue size',
labelNames: ['queue_name', 'priority']
});
queueDepthGauge.labels('orders', 'high').set(highPriorityQueue.length);
const memoryGauge = new client.Gauge({
name: 'memory_usage_bytes',
help: 'Used memory'
});
setInterval(() => {
memoryGauge.set(process.memoryUsage().heapUsed);
}, 10000);
Use setInterval
for system-level gauges or any value that changes periodically.
When to Use .set()
vs Callback
Language | When to Use .set() |
When to Use Callback / Binding |
---|---|---|
Go | For explicit value updates | Rarely used |
Python | For direct updates or polling | Use a background job for periodic data |
Java | Prefer callback-based gauges | Standard in Micrometer |
Node.js | Use .set() with event loops |
No native callback support |
This model ensures gauge values are kept current, without relying on Prometheus to infer or track changes.
rate()
, increase()
, and deriv()
, see this reference on Prometheus functions.How to Monitor System and Application Resources with Gauges
Once you're familiar with how gauges behave, the next step is applying them to practical resource monitoring. Gauges are well-suited for tracking current usage levels and capacity across both system infrastructure and application internals.
System-Level Metrics
Most teams start with basic host metrics, CPU, memory, and disk, typically collected via exporters like node_exporter
.
node_memory_available_bytes
node_filesystem_free_bytes
cpu_usage_percent
These metrics help detect infrastructure-level problems such as:
- Memory exhaustion
- Disk space running low
- CPU saturation across cores
They’re useful for alerting on system-wide resource constraints that could eventually affect application performance.
Application-Level Capacity Gauges
System metrics don't reveal how your application is handling load. For that, you need application-specific gauges that expose internal resource usage, such as thread pools, queues, and caches.
connection_pool_active_connections
worker_pool_busy_workers
cache_size_bytes
These metrics provide visibility into:
- Whether a connection pool is consistently at or near its limit
- If the worker pool is fully utilized, suggesting contention
- How fast an in-memory cache is growing, which could signal pressure on memory
Monitoring these values helps identify issues like thread starvation, queue backlogs, or degraded throughput, even when system metrics appear normal.
Performance Implications of Gauge vs Counter Metrics
The metric type you choose, gauge or counter, affects performance across memory usage, query speed, and storage footprint. Here’s how they differ:
Memory Consumption
Gauges with high-cardinality labels (e.g., user_id
, session_id
) can create a large number of time series. Each unique label combination results in a separate series, increasing memory usage on both the Prometheus server and remote storage systems.
# High-cardinality pattern (should be avoided)
active_sessions{user_id="123456"} 1
Counters also create time series per label combination, but they're typically used in lower-cardinality contexts like error codes or endpoints.
Query Performance
Gauges are often faster to query because they return the latest scraped value without needing time-window computation. For example:
avg(cpu_usage_percent) # Gauge query
In contrast, counter queries usually require functions like rate()
or increase()
over a time window, which involves more CPU and memory for query execution:
rate(http_requests_total[5m])
Storage Overhead
Counters benefit from delta encoding, Prometheus only stores changes between values, which makes them more efficient over time.
Gauges store the raw value at each scrape interval. For metrics that change frequently (e.g., memory usage, connection counts), this can lead to higher storage usage.
Scrape Behavior
Gauges represent the instantaneous state. If a value changes rapidly between scrapes, those changes can be lost; Prometheus will only see the latest value at scrape time.
Counters accumulate values between scrapes. Even if events occur and resolve between scrapes, the counter will reflect the total change.
If you're instrumenting metrics for high-frequency events or need precise change tracking, counters are more reliable. If you're tracking the current system or application state, gauges are more appropriate, but need careful lifecycle and cardinality management.

Monitor Queue Depth and Backpressure with Gauges
Queue depth is one of the most valuable application-level metrics to track using a gauge. It reflects how much work is waiting to be processed and serves as an early indicator of backpressure.
Backpressure occurs when the rate of incoming tasks exceeds the system’s processing capacity. As a result, queues or buffers start to grow, even before error rates, latency spikes, or dropped requests become visible.
How to Instrument
Use a gauge to track queue size, and update it in real time:
- In-memory queues: Call
.Set()
with the current length as the queue changes. - External systems (e.g., Redis, RabbitMQ): Query the queue depth via client APIs or system commands, then update the gauge.
Avoid using background timers to update queue depth values can change rapidly, and you may miss key transitions if not updated immediately.
Patterns to Watch
- Steady growth
Suggests consumers are consistently slower than producers — a capacity or scaling issue. - Sudden spikes
It may be harmless for bursty traffic, but it can also signal retries, batching issues, or upstream throttling. - Slow or uneven draining
Indicates downstream latency, inefficient consumers, or long-tail processing delays.
Tracking queue depth as a gauge gives you earlier insight into load-related issues than waiting for errors or saturation signals. It’s often the first metric to change when throughput begins to degrade.
Workflow State Tracking with Labeled Gauges
Gauges are a good fit for tracking entities across known states, such as jobs, orders, or connections, in systems where items progress through a lifecycle. This pattern makes it easy to identify bottlenecks or failure points by observing state distributions over time.
Examples
Track counts of items in each state by exposing labeled gauges:
# Order workflow
orders_by_status{status="pending"} 23
orders_by_status{status="processing"} 8
orders_by_status{status="completed"} 1542
# Database connections
db_connections{state="active"} 45
db_connections{state="idle"} 155
db_connections{state="waiting"} 3
Each label value (status
, state
, etc.) corresponds to a specific stage in the workflow.
What to Look for
- A rising count in an early stage (e.g.,
pending
) without corresponding growth in the next stage (processing
) usually signals a stall. - Imbalances across states may indicate:
- Workers not picking up tasks
- Throttling or rate limits
- Downstream services failing or timing out
Where This Pattern Applies
- Job queues and task runners
- Order or transaction pipelines
- Connection pools and session managers
- Batch workflows with retries, delays, or staged transitions
This technique helps surface operational problems that don’t always show up in latency or error metrics. Each gauge reflects the current number of items in a specific state, updated as entities transition.
Set vs Add: Choose the Correct Gauge Update Method
Prometheus gauges can be updated in two primary ways: by setting an explicit value using Set()
, or by applying a delta using Add()
, Inc()
, or Dec()
. The right method depends on what your application can reliably observe.
If your application has direct access to the current value — for example, memory usage, queue depth, or cache size — use Set()
. These metrics represent a concrete state, and it’s common to sample and report them periodically or whenever a change occurs. A typical case would be:
memoryGauge.Set(getCurrentMemoryUsage())
In contrast, if the application only observes discrete events — such as a new connection being opened or a request being completed — use Inc()
or Dec()
. These methods are useful when you can’t compute the full value directly, but you can track the changes over time:
connectionGauge.Inc() // when a new connection is accepted
connectionGauge.Dec() // when a connection closes
This pattern works well for in-flight request counts, open sessions, or active workers, where the value depends on events rather than a snapshot of state.
However, be cautious with Inc()
/Dec()
if your process crashes or misses events, there's no automatic reconciliation. Over time, the gauge can drift unless you reset or reinitialize it. When accuracy is critical, and you can query the actual value, Set()
is the safer option.
Gauge Staleness and Alert Failures
Prometheus does not automatically expire metric values. If a gauge stops updating, due to a crashed exporter, paused job, or missing startup logic, the last known value continues to be scraped and stored. From Prometheus’s point of view, the metric is still valid, even if it’s outdated.
Why Alerts Miss Stale Gauge Values
Alert logic often assumes that gauges reflect the current system state. When a gauge silently goes stale, the metric might look normal while the underlying system is failing:
- A
queue_depth
gauge shows0
, but the service emitting it is no longer running. - A
memory_usage_bytes
gauge remains constant even while memory usage spikes or the container crashes. - An
active_sessions
gauge hasn’t changed for hours, but nothing in Prometheus indicates it's stale.
Because Prometheus stores the last observed value unless the series disappears entirely, these stale readings often lead to missed alerts.
Ways to Detect Gauge Staleness
Two approaches help prevent alert failures caused by stale values:
Use Time-Based Validity Checks
When a metric is expected to change regularly, you can compare the current time to the last update:
(time() - max_over_time(memory_usage_bytes[5m])) > 300
This fires if the value hasn’t changed in the past five minutes — a strong indicator that the metric is stale.
Use the up
Metric to Check Target Health
Prometheus automatically exposes an up
metric per target. If a scrape fails, up == 0
:
up == 0
This should be part of any alerting strategy — it directly tells you that the exporter or service is unreachable, and all metrics from that target may be unreliable.
Ways to Keep Gauge Values Fresh
You can prevent most gauge staleness issues with a few implementation practices:
Refresh metrics on a schedule if they reflect sampled or polled state:
go func() {
ticker := time.NewTicker(10 * time.Second)
for range ticker.C {
memoryGauge.Set(getMemoryUsage())
diskGauge.Set(getDiskUsage())
}
}()
Set an initial value on startup so that old values aren’t reused after a restart:
activeSessionsGauge.Set(0)
Avoid workarounds like embedding timestamps in labels — that introduces high cardinality and degrades performance. Stick with static label sets and use PromQL to assess freshness when needed.
Gauge Cardinality Limits and Label Design
In Prometheus, each unique combination of label values creates a separate time series. When gauges use high-cardinality labels, like user_id
, session_id
, or request_path
— the number of series can grow rapidly, putting pressure on memory, storage, and query performance.
Examples
# High-cardinality pattern — creates one time series per user
active_sessions{user_id="12345"} 1
# Safer pattern — groups by user type
active_sessions{status="authenticated"} 1547
active_sessions{status="guest"} 234
Even a single gauge with a dynamic label can result in millions of time series if not handled carefully.
Label Strategy for Controlling Cardinality
- Avoid unbounded labels: Never use labels that take on user-generated or highly variable values (e.g., email, UUIDs, full URLs, timestamps).
- Group by stable attributes: Use fields with limited value sets like
status
,region
, orrole
. - Push detailed data to logs or traces: If you need per-user or per-request visibility, capture it outside of metrics.
Cardinality issues often show up only after dashboards begin to time out or storage usage spikes. The safest approach is to treat label selection as an architectural decision, define allowed label keys, and review new metrics for explosion risk before deployment.
How to Migrate from Gauges to Counters
If you've been using a gauge to track a cumulative value, like request counts, error totals, or bytes transferred, you're likely missing out on proper rate calculations and reset handling.
Prometheus counters are designed for this purpose, and migrating to them improves both accuracy and reliability. But the switch requires planning to avoid breaking dashboards and alerts.
A safe migration usually follows a phased, backward-compatible strategy:
- Deploy the new counter metric alongside the existing gauge.
This avoids breaking existing consumers of the gauge while giving you time to transition. - Update dashboards and alerts to reference the new counter.
Userate()
orincrease()
functions as appropriate, and verify the results match your expectations. - Keep both metrics active for a defined period.
This gives you time to compare behaviors, confirm alert triggers, and validate accuracy under load. - Remove the old gauge after full validation.
Once you're confident the counter works as expected and no consumers rely on the gauge, you can safely deprecate it.
In some cases, you may want to A/B test both approaches before committing:
- Run two service versions in parallel — one emitting a gauge, the other a counter.
- Compare how each metric behaves under different traffic patterns.
- Measure the impact on scrape performance, storage usage, and dashboard responsiveness.
Migration isn't just about correctness; it's also about observability hygiene. Using the right metric type makes queries easier to reason about, reduces operational surprises, and aligns with Prometheus best practices.
Quick Start: 5-Minute Gauge Setup
If you're new to gauges, start with these essential patterns:
1. Memory usage gauge (any language):
memoryGauge := prometheus.NewGauge(prometheus.GaugeOpts{
Name: "memory_usage_bytes",
Help: "Current memory usage",
})
2. Queue depth gauge:
queueGauge := prometheus.NewGauge(prometheus.GaugeOpts{
Name: "queue_depth",
Help: "Messages waiting in queue",
})
3. Active connections gauge:
connectionsGauge := prometheus.NewGauge(prometheus.GaugeOpts{
Name: "active_connections",
Help: "Currently active connections",
})
Update these gauges every 10-30 seconds, matching your Prometheus scrape interval.
Final Thoughts
Gauge behavior varies across observability platforms, especially with high-cardinality labels or metrics that update inconsistently. This impacts query reliability, alert accuracy, and system performance.
Last9 handles this by ingesting Prometheus and OpenTelemetry metrics natively, optimizing storage for high-cardinality gauges, and keeping queries fast as your data scales. It helps teams track application state reliably without dealing with stale values, bloated storage, or degraded dashboards.
Getting started with Last9 takes just a few minutes — no changes required to your existing setup.
FAQs
How often should I update gauge values?
Match your scrape interval, typically 10-30 seconds. Updating more frequently wastes resources; updating less frequently might miss short-lived spikes.
Can gauges go negative?
Yes, Prometheus gauges support negative values. This works well for metrics like temperature, account balances, or any measurement with a natural zero point.
How do I handle gauges when instances restart?
Reset gauges to their correct current state when the service starts. Don't assume they'll be 0—read from your data store or recalculate from the current system state. Unlike counters, gauge resets aren't automatically detected by monitoring systems.
Should I use gauges for percentage values?
Usually yes, if you're tracking current utilization. But consider whether you want the raw values (bytes used, bytes total) as separate metrics for more flexible queries.
Can I use a gauge like a counter?
Technically, yes, but it's not recommended. You lose automatic reset detection and the rate calculation features that make counters valuable for cumulative metrics. Stick to the right tool for the job.
What happens if a counter decreases?
Most monitoring systems assume the counter is reset to zero and adjust calculations accordingly. This helps maintain accurate metrics even after application restarts—a behavior that gauges don't provide.
How do I detect when my gauges become stale?
Combine your gauge alerts with freshness checks using (time() - timestamp(your_gauge)) < 300
to ensure the metric was updated within the last 5 minutes. Also, monitor the up
metric for your scrape targets.