This blog post dives deep into the four metrics supported by Prometheus, along with their use cases and PromQL functions that can be used to query these metric types.
At its core, a metric in the context of monitoring and system performance is a quantifiable measure that is used to track and assess the status of a specific process or activity.
In Prometheus, every metric represents one or more time series, with each time series consisting of a metric name, a set of labels, and a series of data points (each with a timestamp and a value). Time series data in monitoring refers to data points indexed in time order.
Metrics Structure in Prometheus:
The structure of a metric typically includes the following key components:
- Metric Name: An explicit identifier for the metric, often reflecting what it measures. For example,
http_requests_total
. - Labels: Key-value pairs that provide additional dimensions to the metric, enabling more detailed and specific tracking. An example label for
http_requests_total
could be {method="GET", endpoint="/api"}
. - Metric Value: The actual data point representing the measurement, which could be a count, a duration, etc.
- Timestamp: The point in time when the metric value was recorded (often added automatically by the monitoring system)
Consider a metric named user_logins_total
—
This metric could have labels like {user_type="admin", location="EU"}
and a numerical value indicating the total count of logins. The timestamp would denote when this count was recorded.
Types of Prometheus Metrics
Prometheus primarily deals with four types of metrics:
- Counter: A metric that only increases or resets to zero on restart. Ideal for tracking the number of requests, tasks completed, or errors.
- Gauge: This represents a value that can go up or down, like temperature or current memory usage.
- Histogram: Measures the distribution of events over time, such as request latencies or response sizes.
- Summary: Similar to histograms but provides a total count and sum of observed values.
Lets dive deeper into each Prometheus metrics types.
Counters
A Counter is a cumulative metric used primarily for tracking quantities that increase over time. Said simply, a counter .. counts! Counters are ideal for monitoring the rate of events, like the total number of requests to a web server, task completions, or error occurrences. It is designed to only increase, which means its value should never decrease (except when reset to zero, usually due to a restart or reset of the process generating it).
Counters are often visualized in dashboards to show trends over time, like the rate of requests to a web service. For instance, They can trigger alerts if the rate of errors or specific events exceeds a threshold, indicating potential issues in the monitored system.
node_network_receive_bytes_total
in Node Exporter is an example of a counter that can be used to track the total number of bytes received on a network interface.
Using PromQL functions like rate()
or increase()
to calculate the per-second average rate of increase of the counter over a period of time.
The rate()
function in PromQL is used to calculate a metric's per-second average rate of increase over a given time interval.
To determine the total change in a metric over a specific time period, PromQL increase()
calculates the cumulative increase of a metric over a given time range.
Thereset()
function identifies the number of times a counter has been reset during a given time period.
Note on Counter Reset:
There are scenarios where a counter can reset. The most common reason for a counter reset is when the process generating the metric restarts. This could be due to a service restart, deployment, or a system reboot. When this happens, the counter starts from zero again.
💡
An upcoming feature in Prometheus adds
created timestamp
a metric to solve the long-standing issues with counter-resets. See the
talk from Promcon 2023.
This reset behavior is crucial for understanding how to interpret counter data, especially when using functions like rate()
or increase()
in PromQL. These functions are designed to account for counter resets. They can detect a reset by identifying when the counter value decreases from one scrape interval to the next. Upon detecting a reset, these functions assume the counter has been set to zero and then started increasing again.
It's important to be aware of counter resets because they can impact how you interpret the data. A sharp drop in a counter value followed by an increase could be misinterpreted as a decrease in the metric being measured, when in reality, it's just a reset. Understanding this behavior is key to accurately monitoring and analyzing metrics in Prometheus.
Gauges
Gauges represent a metric that can increase or decrease, akin to a thermometer. Gauges are versatile and can measure values like memory usage, temperature, or queue sizes, giving a snapshot of a system's state at a specific point in time.
Gauges are straightforward in terms of updating their value. They can be set to a particular value at any given time, incremented or decremented, based on the metric they are tracking.
Gauges are often visualized using line graphs in dashboards to depict their value changes over time. They are handy for observing the current state and trends of what's being measured rather than the rate of change.
From the JMX Exporter, which is used for Java applications, a Gauge might be employed to monitor the number of active threads in a JVM labeled as jvm_threads_current
.
When working with gauges, which can fluctuate up and down, specific functions are typically used to calculate statistical measures over a time series.
These functions include:
avg_over_time()
- for computing the average
max_over_time()
- for finding the maximum value
min_over_time()
- for the minimum value
quantile_over_time()
- for determining percentiles within the specified period
delta()
- for the difference in the gauge value over the time series
These functions are instrumental in analyzing the trends and variations of gauge metrics, providing valuable insights into the performance and state of monitored systems.
Histogram
Histograms are used to sample and aggregate distributions, such as latencies. Histograms are excellent for understanding the distribution of metric values and helpful in performance analysis, like tracking request latencies or response sizes.
Histograms efficiently categorize measurement data into defined intervals, known as buckets, and tally the number (i.e., a counter) of measurements that fit into each of these buckets. These buckets are pre-defined during the instrumentation stage.
A key thing to note in the Prometheus Histogram type is that the buckets are cumulative. This means each bucket counts all values less than or equal to its upper bound, providing a cumulative distribution of the data. Simply put, each bucket contains the counts of all prior buckets. We will explore this in the example below.
Let's take an example of observing response times with buckets —
We could classify request times into meaningful time buckets like
0 to 200ms - le="200"
(less or equal to 200)
200ms to 300ms - le="300"
(less or equal to 300)
… and so on
Prometheus also adds a +inf
bucket by default
Let's say our API’s response time observed is 175ms; the count values for the bucket will look something like this:
Bucket |
Count |
0 - 200 |
1 |
0 - 300 |
1 |
0 - 500 |
1 |
0 - 1000 |
1 |
0 - +Inf |
1 |
Here, you can see how the cumulative nature of the histogram works.
Let's say in the following observation our API’s response time is 300ms; the count values will look like this:
Bucket |
Count |
0 - 200 |
1 |
0 - 300 |
2 |
0 - 500 |
2 |
0 - 1000 |
2 |
0 - +Inf |
2 |
It is essential to note the histogram-type metric's structure for properly querying it.
Each bucket is available as a “counter,” which can be accessed by adding a _bucket
suffix and the le
label. The suffix of _count
and _sum
are generated by default to help with the qualitative calculation.
_count
is a counter with the total number of measurements available for the said metric.
_sum
is a counter with the total (or the sum) of all values of the measurement.
For Example :
http_request_duration_seconds*_sum*{host="example.last9.io"} 9754.113
http_request_duration_seconds*_count*{host="example.last9.io"} 6745
http_request_duration_seconds_*bucket*{host="example.last9.io", le="200"} 300
http_request_duration_seconds_*bucket*{host="example.last9.io", le="300"} 124
...
The histogram_quantile()
function calculates quantiles (e.g., medians, 95th percentiles) from histograms. It takes a quantile (a value between 0 and 1) and a histogram metric as arguments and computes the estimated value at that quantile across the histogram's buckets.
For instance, histogram_quantile(0.95, metric_name_here)
estimates the value below which 95% of the observations in the histogram fall, providing insights into distribution tails like request latencies.
The histogram data type can also be aggregated, i.e., combining multiple histograms into a single histogram. Suppose you're monitoring response times across different servers. Each server emits a histogram of response times. You would aggregate these individual histograms to understand the overall response time distribution across all servers. This aggregation is done by summing up the counts in corresponding buckets across all histograms.
For example, you could use a PromQL query like this -
sum by (le) (rate(http_request_duration_seconds_bucket{endpoint="payment"}[5m]))
In this example, the sum by (le)
part aggregates the counts in each bucket (le
label) across all instances of the endpoint labeled "payment"
. The rate
function is applied over a 5-minute interval ([5m]
), calculating the per-second rate of increase for each bucket, which is helpful for histograms derived from counters. This query gives a unified view of the request duration distribution across all servers for the specified endpoint.
Native Histograms
Starting from Prometheus version 2.40, an experimental feature provides support for native histograms. With native histograms, you only need a one-time series, and it includes a variable number of buckets along with the sum and count of observations. This feature offers significantly higher resolution while being more cost-effective.
Summary
Summaries track the size and number of events. Summaries are ideal for calculating quantiles and averages; Summaries are used for metrics where aggregating over time and space is essential, like request latency or transaction duration.
A summary metric automatically calculates and stores quantiles (e.g., 50th, 90th, 95th percentiles) over a sliding time window. This means it tracks both the number of observations (like requests) and their sizes (like latency), and then computes the quantiles of these observations in real-time. A Prometheus summary typically consists of three parts: the count (_count
) of observed events, the sum of these events' values (_sum
), and the calculated quantiles.
Example -
# HELP http_request_duration_seconds The duration of HTTP requests in seconds
# TYPE http_request_duration_seconds summary
http_request_duration_seconds{quantile="0.5"} 0.055
http_request_duration_seconds{quantile="0.9"} 0.098
http_request_duration_seconds{quantile="0.95"} 0.108
http_request_duration_seconds{quantile="0.99"} 0.15
http_request_duration_seconds_sum 600
http_request_duration_seconds_count 10000
Summaries are better suited when you need accurate quantiles for individual instances or components and don't intend to aggregate these quantiles across different dimensions or labels. Compared to histograms, which are helpful when aggregating data across multiple instances or dimensions, like calculating global request latency across several servers.
A significant limitation of summaries is that you cannot aggregate their quantiles across multiple instances. While you can sum the counts and sums, the quantiles are only meaningful within the context of a single summary instance.
Summaries can be more resource-intensive since they compute quantiles on the fly and keep a sliding window of observations. Histograms can be more efficient regarding memory and CPU usage, especially when dealing with high-cardinality data. Since the bucket configuration is fixed, they can also be optimized for storage.
Summing up
The fundamentals we have covered in this post around metrics types in Prometheus will hopefully help you better grasp your monitoring setup.
In previous posts, we have posted various posts covering the fundamentals of Prometheus Monitoring and Prometheus Cardinality.
If you or your team is looking to get started using Prometheus, you can consider hosted and managed prometheus offering that can help eliminate your cardinality and long-term storage woes while reducing your monitoring cost significantly.