Prometheus Gauges vs Counters: What to Use and When

Choosing the wrong metric type in Prometheus can lead to inaccurate dashboards, false positives in alerting, and missed indicators of system failure. Gauge metrics are intended for tracking values that can go up and down, such as memory usage, queue depth, or the number of active connections.

Unlike counters, which only increment (or reset on restart), gauges reflect the current state of a resource at scrape time. Misusing gauges, or using them without proper lifecycle handling, can result in stale data, incorrect assumptions about system health, and alerts that trigger without a real issue.

This blog covers when to use gauges instead of counters, how to instrument them correctly, and implementation patterns that surface system issues earlier and more reliably.

Common Mistakes That Break Gauge Metrics in Production

Before getting into implementation, it’s worth understanding the most frequent (and costly) misuse patterns developers run into when working with Prometheus gauges:

Treating gauges like counters
Using Inc() to track events — such as HTTP requests — breaks reset detection and invalidates rate() or increase() calculations. Gauges don't track history, so you lose the ability to compute accurate trends.
Relying on stale values
If a service crashes or stops updating metrics, Prometheus continues scraping the last reported value. Dashboards and alerts may appear normal, even though the data is no longer valid.
High-cardinality label usage
Adding labels like user_id, request_path, or session_id to gauges creates a separate time series for each unique combination. This can lead to unbounded memory growth and slow query performance.
Missing initialization on startup
If a gauge isn’t explicitly reset or set when the service starts, Prometheus may retain the last value from a previous run. This can make current-state metrics incorrect or misleading.

Metric Types in Prometheus

Prometheus supports four core metric types: counter, gauge, histogram, and summary. Each serves a different purpose and affects how you write queries.

Counter: A cumulative value that only increases (or resets on restart). Suitable for metrics like total_requests_handled or errors_total.
Gauge: Represents a value that can increase or decrease, such as memory_usage_bytes, cpu_temperature, or queue_length.
Histogram: Samples observations into configurable buckets to approximate distributions (e.g., request durations). Requires rate() for meaningful query patterns.
Summary: Similar to histograms but pre-computes quantiles on the client side. Less commonly used due to limitations in aggregation and scaling.

Choosing the correct type is essential; it directly impacts how metrics behave over time, how they’re queried, and how accurately they reflect system state.

Here’s a quick comparison of the four Prometheus metric types:

Metric Type	Value Behavior	Common Use Cases	Query Patterns	Notes
Counter	Increases only (resets on restart)	Requests served, errors, jobs processed	`rate()`, `increase()`	Good for tracking totals. Must use rate-based functions to be useful.
Gauge	Goes up and down	Memory usage, queue depth, temperature	Direct queries (`>`, `avg()`, etc.)	Use for current-state metrics. Must manage staleness and resets.
Histogram	Buckets values into ranges	Request durations, payload sizes	`rate(metric_bucket[5m])` + `histogram_quantile()`	Good for distribution analysis. Requires post-processing in PromQL.
Summary	Calculates quantiles locally	Latency quantiles, custom percentiles	Limited query support	Not aggregatable across instances. Use with caution at scale.

💡

For a breakdown of all Prometheus metric types and when to use them, see this guide on Prometheus metric types.

Gauges vs. Counters: Behavior and Reset Semantics

Prometheus supports multiple metric types, but counters and gauges are the most commonly used. Understanding how they behave, especially during application restarts, is key to writing correct queries and avoiding misleading data.

Counter Behavior

Counters are monotonic: they only increase and reset to zero when the application restarts.

http_requests_total{method="GET"} 1547

Prometheus handles resets by detecting a drop in value between scrapes. In most cases, this works well, but there are edge cases:

The previous value was low, making the drop ambiguous.
Multiple restarts happen between scrapes.
Clock skew affects timestamp order.

Because counters accumulate over time, you typically don’t query their raw values. Instead, use rate functions:

rate(http_requests_total[5m])       # requests per second
increase(http_requests_total[1h])   # total requests in the last hour

Gauge Behavior

Gauges represent a value that can go up or down, they capture the current state of something that changes over time.

memory_usage_bytes{instance="web-01"} 2147483648
active_connections{service="api"} 42
queue_depth{queue="orders"} 15

Gauges are set explicitly by your application, so they don’t reset automatically on restart. If a gauge isn’t reinitialized, Prometheus may continue scraping an outdated value or drop the time series altogether, depending on scrape timing.

This makes lifecycle handling critical:

Always set gauges on startup.
Don’t assume zero is the default.
If your update loop crashes or stalls, the value will freeze.

Recovery Strategies for Reset Handling

To handle edge cases where counter reset detection or gauge freshness breaks down:

Use resets(metric_name) to count how often a counter reset is detected.
Monitor the up metric to detect if the scrape target is down or stale.
Set gauge values explicitly on startup and at regular intervals.
Use alerting logic that checks for stale data (e.g., using deriv() or timestamp()).

Querying Gauges vs. Counters

Gauges: Direct queries are usually sufficient.

avg(memory_usage_bytes)                   # average across instances
max(cpu_usage_percent)                    # peak CPU
queue_depth{queue="payments"} > 100       # alert threshold

Counters: Always use rate-based functions for meaningful insights.

rate(http_requests_total[5m])
increase(errors_total[30m])

The key difference: counters track events over time, while gauges report instantaneous values. Choose based on whether you care about how much something happened vs. where it stands right now.

💡

If you're running into scale issues with high-cardinality gauge data, this guide to scaling Prometheus covers practical fixes.

How to Instrument Gauges in Your Application (By Language)

Gauges must be updated explicitly by your application. Prometheus clients exist for most languages, but each has slightly different patterns for setting and exposing gauge values.

Go: Set Values Explicitly

In Go, use the prometheus/client_golang library. Define a GaugeVec for labeled metrics and update it using .Set() when the value changes.

var queueDepth = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "message_queue_depth",
        Help: "Current number of messages in queue",
    },
    []string{"queue_name", "priority"},
)

queueDepth.WithLabelValues("orders", "high").Set(float64(len(highPriorityQueue)))

Use .Set() when your application can retrieve the current value directly, such as from in-memory queues or counters.

Python: Direct Value Updates

With prometheus_client, define the gauge and call .set() directly.

from prometheus_client import Gauge

queue_depth = Gauge('message_queue_depth', 'Queue size', ['queue_name', 'priority'])
queue_depth.labels(queue_name='orders', priority='high').set(len(high_priority_queue))

For system metrics, use a library like psutil to pull values:

import psutil
memory_gauge = Gauge('memory_usage_bytes', 'Used memory in bytes')
memory_gauge.set(psutil.virtual_memory().used)

Schedule updates using a background loop or a scheduled job if the value changes over time.

Java (Micrometer): Register with a Callback

Micrometer gauges typically use a function that returns the current value. This callback is evaluated at scrape time.

@Component
public class QueueMetrics {
    private final Queue<Message> messageQueue = new ConcurrentLinkedQueue<>();

    public QueueMetrics(MeterRegistry meterRegistry) {
        Gauge.builder("message_queue_depth", messageQueue, Queue::size)
            .description("Current number of messages in queue")
            .tags("queue_name", "orders")
            .register(meterRegistry);
    }
}

This approach works well for metrics that can be derived from the current in-memory state.

Node.js: Set in Loops or Event Handlers

The prom-client package uses .set() to update values. These updates can happen during events or at fixed intervals.

const client = require('prom-client');

const queueDepthGauge = new client.Gauge({
    name: 'message_queue_depth',
    help: 'Queue size',
    labelNames: ['queue_name', 'priority']
});

queueDepthGauge.labels('orders', 'high').set(highPriorityQueue.length);

const memoryGauge = new client.Gauge({
    name: 'memory_usage_bytes',
    help: 'Used memory'
});

setInterval(() => {
    memoryGauge.set(process.memoryUsage().heapUsed);
}, 10000);

Use setInterval for system-level gauges or any value that changes periodically.

When to Use `.set()` vs Callback

Language	When to Use `.set()`	When to Use Callback / Binding
Go	For explicit value updates	Rarely used
Python	For direct updates or polling	Use a background job for periodic data
Java	Prefer callback-based gauges	Standard in Micrometer
Node.js	Use `.set()` with event loops	No native callback support

This model ensures gauge values are kept current, without relying on Prometheus to infer or track changes.

💡

For help writing accurate queries with functions like rate(), increase(), and deriv(), see this reference on Prometheus functions.

How to Monitor System and Application Resources with Gauges

Once you're familiar with how gauges behave, the next step is applying them to practical resource monitoring. Gauges are well-suited for tracking current usage levels and capacity across both system infrastructure and application internals.

System-Level Metrics

Most teams start with basic host metrics, CPU, memory, and disk, typically collected via exporters like node_exporter.

node_memory_available_bytes
node_filesystem_free_bytes
cpu_usage_percent

These metrics help detect infrastructure-level problems such as:

Memory exhaustion
Disk space running low
CPU saturation across cores

They’re useful for alerting on system-wide resource constraints that could eventually affect application performance.

Application-Level Capacity Gauges

System metrics don't reveal how your application is handling load. For that, you need application-specific gauges that expose internal resource usage, such as thread pools, queues, and caches.

connection_pool_active_connections
worker_pool_busy_workers
cache_size_bytes

These metrics provide visibility into:

Whether a connection pool is consistently at or near its limit
If the worker pool is fully utilized, suggesting contention
How fast an in-memory cache is growing, which could signal pressure on memory

Monitoring these values helps identify issues like thread starvation, queue backlogs, or degraded throughput, even when system metrics appear normal.

Performance Implications of Gauge vs Counter Metrics

The metric type you choose, gauge or counter, affects performance across memory usage, query speed, and storage footprint. Here’s how they differ:

Memory Consumption

Gauges with high-cardinality labels (e.g., user_id, session_id) can create a large number of time series. Each unique label combination results in a separate series, increasing memory usage on both the Prometheus server and remote storage systems.

# High-cardinality pattern (should be avoided)
active_sessions{user_id="123456"} 1

Counters also create time series per label combination, but they're typically used in lower-cardinality contexts like error codes or endpoints.

Query Performance

Gauges are often faster to query because they return the latest scraped value without needing time-window computation. For example:

avg(cpu_usage_percent)  # Gauge query

In contrast, counter queries usually require functions like rate() or increase() over a time window, which involves more CPU and memory for query execution:

rate(http_requests_total[5m])

Storage Overhead

Counters benefit from delta encoding, Prometheus only stores changes between values, which makes them more efficient over time.

Gauges store the raw value at each scrape interval. For metrics that change frequently (e.g., memory usage, connection counts), this can lead to higher storage usage.

Scrape Behavior

Gauges represent the instantaneous state. If a value changes rapidly between scrapes, those changes can be lost; Prometheus will only see the latest value at scrape time.

Counters accumulate values between scrapes. Even if events occur and resolve between scrapes, the counter will reflect the total change.

If you're instrumenting metrics for high-frequency events or need precise change tracking, counters are more reliable. If you're tracking the current system or application state, gauges are more appropriate, but need careful lifecycle and cardinality management.

Monitor Queue Depth and Backpressure with Gauges

Queue depth is one of the most valuable application-level metrics to track using a gauge. It reflects how much work is waiting to be processed and serves as an early indicator of backpressure.

Backpressure occurs when the rate of incoming tasks exceeds the system’s processing capacity. As a result, queues or buffers start to grow, even before error rates, latency spikes, or dropped requests become visible.

How to Instrument

Use a gauge to track queue size, and update it in real time:

In-memory queues: Call .Set() with the current length as the queue changes.
External systems (e.g., Redis, RabbitMQ): Query the queue depth via client APIs or system commands, then update the gauge.

Avoid using background timers to update queue depth values can change rapidly, and you may miss key transitions if not updated immediately.

Patterns to Watch

Steady growth
Suggests consumers are consistently slower than producers — a capacity or scaling issue.
Sudden spikes
It may be harmless for bursty traffic, but it can also signal retries, batching issues, or upstream throttling.
Slow or uneven draining
Indicates downstream latency, inefficient consumers, or long-tail processing delays.

Tracking queue depth as a gauge gives you earlier insight into load-related issues than waiting for errors or saturation signals. It’s often the first metric to change when throughput begins to degrade.

Workflow State Tracking with Labeled Gauges

Gauges are a good fit for tracking entities across known states, such as jobs, orders, or connections, in systems where items progress through a lifecycle. This pattern makes it easy to identify bottlenecks or failure points by observing state distributions over time.

Examples

Track counts of items in each state by exposing labeled gauges:

# Order workflow
orders_by_status{status="pending"}     23
orders_by_status{status="processing"}   8
orders_by_status{status="completed"} 1542

# Database connections
db_connections{state="active"}   45
db_connections{state="idle"}    155
db_connections{state="waiting"}  3

Each label value (status, state, etc.) corresponds to a specific stage in the workflow.

What to Look for

A rising count in an early stage (e.g., pending) without corresponding growth in the next stage (processing) usually signals a stall.
Imbalances across states may indicate:
- Workers not picking up tasks
- Throttling or rate limits
- Downstream services failing or timing out

Where This Pattern Applies

Job queues and task runners
Order or transaction pipelines
Connection pools and session managers
Batch workflows with retries, delays, or staged transitions

This technique helps surface operational problems that don’t always show up in latency or error metrics. Each gauge reflects the current number of items in a specific state, updated as entities transition.

💡

To automate or inspect gauge data programmatically, take a look at this guide to the Prometheus HTTP API.

Set vs Add: Choose the Correct Gauge Update Method

Prometheus gauges can be updated in two primary ways: by setting an explicit value using Set(), or by applying a delta using Add(), Inc(), or Dec(). The right method depends on what your application can reliably observe.

If your application has direct access to the current value — for example, memory usage, queue depth, or cache size — use Set(). These metrics represent a concrete state, and it’s common to sample and report them periodically or whenever a change occurs. A typical case would be:

memoryGauge.Set(getCurrentMemoryUsage())

In contrast, if the application only observes discrete events — such as a new connection being opened or a request being completed — use Inc() or Dec(). These methods are useful when you can’t compute the full value directly, but you can track the changes over time:

connectionGauge.Inc()  // when a new connection is accepted
connectionGauge.Dec()  // when a connection closes

This pattern works well for in-flight request counts, open sessions, or active workers, where the value depends on events rather than a snapshot of state.

However, be cautious with Inc()/Dec() if your process crashes or misses events, there's no automatic reconciliation. Over time, the gauge can drift unless you reset or reinitialize it. When accuracy is critical, and you can query the actual value, Set() is the safer option.

Gauge Staleness and Alert Failures

Prometheus does not automatically expire metric values. If a gauge stops updating, due to a crashed exporter, paused job, or missing startup logic, the last known value continues to be scraped and stored. From Prometheus’s point of view, the metric is still valid, even if it’s outdated.

Why Alerts Miss Stale Gauge Values

Alert logic often assumes that gauges reflect the current system state. When a gauge silently goes stale, the metric might look normal while the underlying system is failing:

A queue_depth gauge shows 0, but the service emitting it is no longer running.
A memory_usage_bytes gauge remains constant even while memory usage spikes or the container crashes.
An active_sessions gauge hasn’t changed for hours, but nothing in Prometheus indicates it's stale.

Because Prometheus stores the last observed value unless the series disappears entirely, these stale readings often lead to missed alerts.

Ways to Detect Gauge Staleness

Two approaches help prevent alert failures caused by stale values:

Use Time-Based Validity Checks

When a metric is expected to change regularly, you can compare the current time to the last update:

(time() - max_over_time(memory_usage_bytes[5m])) > 300

This fires if the value hasn’t changed in the past five minutes — a strong indicator that the metric is stale.

Use the `up` Metric to Check Target Health

Prometheus automatically exposes an up metric per target. If a scrape fails, up == 0:

up == 0

This should be part of any alerting strategy — it directly tells you that the exporter or service is unreachable, and all metrics from that target may be unreliable.

Ways to Keep Gauge Values Fresh

You can prevent most gauge staleness issues with a few implementation practices:

Refresh metrics on a schedule if they reflect sampled or polled state:

go func() {
    ticker := time.NewTicker(10 * time.Second)
    for range ticker.C {
        memoryGauge.Set(getMemoryUsage())
        diskGauge.Set(getDiskUsage())
    }
}()

Set an initial value on startup so that old values aren’t reused after a restart:

activeSessionsGauge.Set(0)

Avoid workarounds like embedding timestamps in labels — that introduces high cardinality and degrades performance. Stick with static label sets and use PromQL to assess freshness when needed.

💡

To build reliable alerts on top of gauges and other metrics, check out this walkthrough on Prometheus Alertmanager.

Gauge Cardinality Limits and Label Design

In Prometheus, each unique combination of label values creates a separate time series. When gauges use high-cardinality labels, like user_id, session_id, or request_path — the number of series can grow rapidly, putting pressure on memory, storage, and query performance.

Examples

# High-cardinality pattern — creates one time series per user
active_sessions{user_id="12345"} 1

# Safer pattern — groups by user type
active_sessions{status="authenticated"} 1547
active_sessions{status="guest"} 234

Even a single gauge with a dynamic label can result in millions of time series if not handled carefully.

Label Strategy for Controlling Cardinality

Avoid unbounded labels: Never use labels that take on user-generated or highly variable values (e.g., email, UUIDs, full URLs, timestamps).
Group by stable attributes: Use fields with limited value sets like status, region, or role.
Push detailed data to logs or traces: If you need per-user or per-request visibility, capture it outside of metrics.

Cardinality issues often show up only after dashboards begin to time out or storage usage spikes. The safest approach is to treat label selection as an architectural decision, define allowed label keys, and review new metrics for explosion risk before deployment.

How to Migrate from Gauges to Counters

If you've been using a gauge to track a cumulative value, like request counts, error totals, or bytes transferred, you're likely missing out on proper rate calculations and reset handling.

Prometheus counters are designed for this purpose, and migrating to them improves both accuracy and reliability. But the switch requires planning to avoid breaking dashboards and alerts.

A safe migration usually follows a phased, backward-compatible strategy:

Deploy the new counter metric alongside the existing gauge.
This avoids breaking existing consumers of the gauge while giving you time to transition.
Update dashboards and alerts to reference the new counter.
Use rate() or increase() functions as appropriate, and verify the results match your expectations.
Keep both metrics active for a defined period.
This gives you time to compare behaviors, confirm alert triggers, and validate accuracy under load.
Remove the old gauge after full validation.
Once you're confident the counter works as expected and no consumers rely on the gauge, you can safely deprecate it.

In some cases, you may want to A/B test both approaches before committing:

Run two service versions in parallel — one emitting a gauge, the other a counter.
Compare how each metric behaves under different traffic patterns.
Measure the impact on scrape performance, storage usage, and dashboard responsiveness.

Migration isn't just about correctness; it's also about observability hygiene. Using the right metric type makes queries easier to reason about, reduces operational surprises, and aligns with Prometheus best practices.

💡

If you're also working with latency or distribution data, this explanation of histogram buckets in Prometheus breaks it down clearly.

Quick Start: 5-Minute Gauge Setup

If you're new to gauges, start with these essential patterns:

1. Memory usage gauge (any language):

memoryGauge := prometheus.NewGauge(prometheus.GaugeOpts{
    Name: "memory_usage_bytes",
    Help: "Current memory usage",
})

2. Queue depth gauge:

queueGauge := prometheus.NewGauge(prometheus.GaugeOpts{
    Name: "queue_depth",
    Help: "Messages waiting in queue",
})

3. Active connections gauge:

connectionsGauge := prometheus.NewGauge(prometheus.GaugeOpts{
    Name: "active_connections",
    Help: "Currently active connections",
})

Update these gauges every 10-30 seconds, matching your Prometheus scrape interval.

Final Thoughts

Gauge behavior varies across observability platforms, especially with high-cardinality labels or metrics that update inconsistently. This impacts query reliability, alert accuracy, and system performance.

Last9 handles this by ingesting Prometheus and OpenTelemetry metrics natively, optimizing storage for high-cardinality gauges, and keeping queries fast as your data scales. It helps teams track application state reliably without dealing with stale values, bloated storage, or degraded dashboards.

Getting started with Last9 takes just a few minutes — no changes required to your existing setup.

FAQs

How often should I update gauge values?

Match your scrape interval, typically 10-30 seconds. Updating more frequently wastes resources; updating less frequently might miss short-lived spikes.

Can gauges go negative?

Yes, Prometheus gauges support negative values. This works well for metrics like temperature, account balances, or any measurement with a natural zero point.

How do I handle gauges when instances restart?

Reset gauges to their correct current state when the service starts. Don't assume they'll be 0—read from your data store or recalculate from the current system state. Unlike counters, gauge resets aren't automatically detected by monitoring systems.

Should I use gauges for percentage values?

Usually yes, if you're tracking current utilization. But consider whether you want the raw values (bytes used, bytes total) as separate metrics for more flexible queries.

Can I use a gauge like a counter?

Technically, yes, but it's not recommended. You lose automatic reset detection and the rate calculation features that make counters valuable for cumulative metrics. Stick to the right tool for the job.

What happens if a counter decreases?

Most monitoring systems assume the counter is reset to zero and adjust calculations accordingly. This helps maintain accurate metrics even after application restarts—a behavior that gauges don't provide.

How do I detect when my gauges become stale?

Combine your gauge alerts with freshness checks using (time() - timestamp(your_gauge)) < 300 to ensure the metric was updated within the last 5 minutes. Also, monitor the up metric for your scrape targets.

Prometheus Gauges vs Counters: What to Use and When

Contents

Common Mistakes That Break Gauge Metrics in Production

Metric Types in Prometheus

Gauges vs. Counters: Behavior and Reset Semantics

Counter Behavior

Gauge Behavior

Recovery Strategies for Reset Handling

Querying Gauges vs. Counters

How to Instrument Gauges in Your Application (By Language)

Go: Set Values Explicitly

Python: Direct Value Updates

Java (Micrometer): Register with a Callback

Node.js: Set in Loops or Event Handlers

When to Use .set() vs Callback

How to Monitor System and Application Resources with Gauges

System-Level Metrics

Application-Level Capacity Gauges

Performance Implications of Gauge vs Counter Metrics

Memory Consumption

Query Performance

Storage Overhead

Scrape Behavior

Monitor Queue Depth and Backpressure with Gauges

How to Instrument

Patterns to Watch

Workflow State Tracking with Labeled Gauges

Examples

What to Look for

Where This Pattern Applies

Set vs Add: Choose the Correct Gauge Update Method

Gauge Staleness and Alert Failures

Why Alerts Miss Stale Gauge Values

Ways to Detect Gauge Staleness

Use Time-Based Validity Checks

Use the up Metric to Check Target Health

Ways to Keep Gauge Values Fresh

Gauge Cardinality Limits and Label Design

Examples

Label Strategy for Controlling Cardinality

How to Migrate from Gauges to Counters

Quick Start: 5-Minute Gauge Setup

Final Thoughts

FAQs

Contents

Do More with Less

Handcrafted Related Posts

New in OTel: How Prometheus 3.0 Fixes Resource Attributes for OTel Metrics

How sum_over_time Works in Prometheus

Use Telegraf Without the Prometheus Complexity

When to Use `.set()` vs Callback

Use the `up` Metric to Check Target Health