🏏 450 million fans watched the last IPL. What is 'Cricket Scale' for SREs? Know More

Dec 13th, ‘23/8 min read

Prometheus Metrics Types - A Deep Dive

A deep dive on different metric types in Prometheus and best practices

Prometheus Metrics Types - A Deep Dive

This blog post dives deep into the four metrics supported by Prometheus, along with their use cases and PromQL functions that can be used to query these metric types.

At its core, a metric in the context of monitoring and system performance is a quantifiable measure that is used to track and assess the status of a specific process or activity.

In Prometheus, every metric represents one or more time series, with each time series consisting of a metric name, a set of labels, and a series of data points (each with a timestamp and a value). Time series data in monitoring refers to data points indexed in time order.

Metrics Structure in Prometheus:

The structure of a metric typically includes the following key components:

  • Metric Name: An explicit identifier for the metric, often reflecting what it measures. For example, http_requests_total.
  • Labels: Key-value pairs that provide additional dimensions to the metric, enabling more detailed and specific tracking. An example label for http_requests_total could be {method="GET", endpoint="/api"}.
  • Metric Value: The actual data point representing the measurement, which could be a count, a duration, etc.
  • Timestamp: The point in time when the metric value was recorded (often added automatically by the monitoring system)

Consider a metric named user_logins_total

This metric could have labels like {user_type="admin", location="EU"} and a numerical value indicating the total count of logins. The timestamp would denote when this count was recorded.

Types of Prometheus Metrics

Prometheus primarily deals with four types of metrics:

  1. Counter: A metric that only increases or resets to zero on restart. Ideal for tracking the number of requests, tasks completed, or errors.
  2. Gauge: This represents a value that can go up or down, like temperature or current memory usage.
  3. Histogram: Measures the distribution of events over time, such as request latencies or response sizes.
  4. Summary: Similar to histograms but provides a total count and sum of observed values.

Lets dive deeper into each Prometheus metrics types.

Counters

A Counter is a cumulative metric used primarily for tracking quantities that increase over time. Said simply, a counter .. counts! Counters are ideal for monitoring the rate of events, like the total number of requests to a web server, task completions, or error occurrences. It is designed to only increase, which means its value should never decrease (except when reset to zero, usually due to a restart or reset of the process generating it).

Counters are often visualized in dashboards to show trends over time, like the rate of requests to a web service. For instance, They can trigger alerts if the rate of errors or specific events exceeds a threshold, indicating potential issues in the monitored system.

node_network_receive_bytes_total in Node Exporter is an example of a counter that can be used to track the total number of bytes received on a network interface.

Using PromQL functions like rate() or increase() to calculate the per-second average rate of increase of the counter over a period of time.

The rate() function in PromQL is used to calculate a metric's per-second average rate of increase over a given time interval.

To determine the total change in a metric over a specific time period, PromQL increase() calculates the cumulative increase of a metric over a given time range.

Thereset() function identifies the number of times a counter has been reset during a given time period.

Note on Counter Reset:

There are scenarios where a counter can reset. The most common reason for a counter reset is when the process generating the metric restarts. This could be due to a service restart, deployment, or a system reboot. When this happens, the counter starts from zero again.

💡
An upcoming feature in Prometheus adds created timestamp a metric to solve the long-standing issues with counter-resets. See the talk from Promcon 2023.

This reset behavior is crucial for understanding how to interpret counter data, especially when using functions like rate() or increase() in PromQL. These functions are designed to account for counter resets. They can detect a reset by identifying when the counter value decreases from one scrape interval to the next. Upon detecting a reset, these functions assume the counter has been set to zero and then started increasing again.

It's important to be aware of counter resets because they can impact how you interpret the data. A sharp drop in a counter value followed by an increase could be misinterpreted as a decrease in the metric being measured, when in reality, it's just a reset. Understanding this behavior is key to accurately monitoring and analyzing metrics in Prometheus.

Gauges

Gauges represent a metric that can increase or decrease, akin to a thermometer. Gauges are versatile and can measure values like memory usage, temperature, or queue sizes, giving a snapshot of a system's state at a specific point in time.

Gauges are straightforward in terms of updating their value. They can be set to a particular value at any given time, incremented or decremented, based on the metric they are tracking.

Gauges are often visualized using line graphs in dashboards to depict their value changes over time. They are handy for observing the current state and trends of what's being measured rather than the rate of change.

From the JMX Exporter, which is used for Java applications, a Gauge might be employed to monitor the number of active threads in a JVM labeled as jvm_threads_current.

When working with gauges, which can fluctuate up and down, specific functions are typically used to calculate statistical measures over a time series.

These functions include:

avg_over_time() - for computing the average

max_over_time() - for finding the maximum value

min_over_time() - for the minimum value

quantile_over_time() - for determining percentiles within the specified period

delta() - for the difference in the gauge value over the time series

These functions are instrumental in analyzing the trends and variations of gauge metrics, providing valuable insights into the performance and state of monitored systems.

Histogram

Histograms are used to sample and aggregate distributions, such as latencies. Histograms are excellent for understanding the distribution of metric values and helpful in performance analysis, like tracking request latencies or response sizes.

Histograms efficiently categorize measurement data into defined intervals, known as buckets, and tally the number (i.e., a counter) of measurements that fit into each of these buckets. These buckets are pre-defined during the instrumentation stage.

A key thing to note in the Prometheus Histogram type is that the buckets are cumulative. This means each bucket counts all values less than or equal to its upper bound, providing a cumulative distribution of the data. Simply put, each bucket contains the counts of all prior buckets. We will explore this in the example below.

Let's take an example of observing response times with buckets —

We could classify request times into meaningful time buckets like

0 to 200ms - le="200" (less or equal to 200)

200ms to 300ms - le="300" (less or equal to 300)

… and so on

Prometheus also adds a +inf bucket by default

Let's say our API’s response time observed is 175ms; the count values for the bucket will look something like this:

Bucket Count
0 - 200 1
0 - 300 1
0 - 500 1
0 - 1000 1
0 - +Inf 1

Here, you can see how the cumulative nature of the histogram works.

Let's say in the following observation our API’s response time is 300ms; the count values will look like this:

Bucket Count
0 - 200 1
0 - 300 2
0 - 500 2
0 - 1000 2
0 - +Inf 2

It is essential to note the histogram-type metric's structure for properly querying it.

Each bucket is available as a “counter,” which can be accessed by adding a _bucket suffix and the le label. The suffix of _count and _sum are generated by default to help with the qualitative calculation.

_count is a counter with the total number of measurements available for the said metric.

_sum is a counter with the total (or the sum) of all values of the measurement.

For Example :

http_request_duration_seconds*_sum*{host="example.last9.io"} 9754.113
http_request_duration_seconds*_count*{host="example.last9.io"} 6745

http_request_duration_seconds_*bucket*{host="example.last9.io", le="200"} 300
http_request_duration_seconds_*bucket*{host="example.last9.io", le="300"} 124
...

The histogram_quantile() function calculates quantiles (e.g., medians, 95th percentiles) from histograms. It takes a quantile (a value between 0 and 1) and a histogram metric as arguments and computes the estimated value at that quantile across the histogram's buckets.

For instance, histogram_quantile(0.95, metric_name_here) estimates the value below which 95% of the observations in the histogram fall, providing insights into distribution tails like request latencies.

The histogram data type can also be aggregated, i.e., combining multiple histograms into a single histogram. Suppose you're monitoring response times across different servers. Each server emits a histogram of response times. You would aggregate these individual histograms to understand the overall response time distribution across all servers. This aggregation is done by summing up the counts in corresponding buckets across all histograms.

For example, you could use a PromQL query like this -

sum by (le) (rate(http_request_duration_seconds_bucket{endpoint="payment"}[5m]))

In this example, the sum by (le) part aggregates the counts in each bucket (le label) across all instances of the endpoint labeled "payment". The rate function is applied over a 5-minute interval ([5m]), calculating the per-second rate of increase for each bucket, which is helpful for histograms derived from counters. This query gives a unified view of the request duration distribution across all servers for the specified endpoint.

Native Histograms

Starting from Prometheus version 2.40, an experimental feature provides support for native histograms. With native histograms, you only need a one-time series, and it includes a variable number of buckets along with the sum and count of observations. This feature offers significantly higher resolution while being more cost-effective.

Summary

Summaries track the size and number of events. Summaries are ideal for calculating quantiles and averages; Summaries are used for metrics where aggregating over time and space is essential, like request latency or transaction duration.

A summary metric automatically calculates and stores quantiles (e.g., 50th, 90th, 95th percentiles) over a sliding time window. This means it tracks both the number of observations (like requests) and their sizes (like latency), and then computes the quantiles of these observations in real-time. A Prometheus summary typically consists of three parts: the count (_count) of observed events, the sum of these events' values (_sum), and the calculated quantiles.

Example -

# HELP http_request_duration_seconds The duration of HTTP requests in seconds
# TYPE http_request_duration_seconds summary
http_request_duration_seconds{quantile="0.5"} 0.055
http_request_duration_seconds{quantile="0.9"} 0.098
http_request_duration_seconds{quantile="0.95"} 0.108
http_request_duration_seconds{quantile="0.99"} 0.15
http_request_duration_seconds_sum 600
http_request_duration_seconds_count 10000

Summaries are better suited when you need accurate quantiles for individual instances or components and don't intend to aggregate these quantiles across different dimensions or labels. Compared to histograms, which are helpful when aggregating data across multiple instances or dimensions, like calculating global request latency across several servers.

A significant limitation of summaries is that you cannot aggregate their quantiles across multiple instances. While you can sum the counts and sums, the quantiles are only meaningful within the context of a single summary instance.

Summaries can be more resource-intensive since they compute quantiles on the fly and keep a sliding window of observations. Histograms can be more efficient regarding memory and CPU usage, especially when dealing with high-cardinality data. Since the bucket configuration is fixed, they can also be optimized for storage.

Summing up

The fundamentals we have covered in this post around metrics types in Prometheus will hopefully help you better grasp your monitoring setup.

In previous posts, we have posted various posts covering the fundamentals of Prometheus Monitoring and Prometheus Cardinality.

If you or your team is looking to get started using Prometheus, you can consider hosted and managed prometheus offering that can help eliminate your cardinality and long-term storage woes while reducing your monitoring cost significantly.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors

Tripad Mishra

**Hello**