This blog post dives deep into the four metrics supported by Prometheus, along with their use cases and PromQL functions that can be used to query these metric types.
At its core, a metric in the context of monitoring and system performance is a quantifiable measure that is used to track and assess the status of a specific process or activity.
In Prometheus, every metric represents one or more time series, with each time series consisting of a metric name, a set of labels, and a series of data points (each with a timestamp and a value). Time series data in monitoring refers to data points indexed in time order.
At its core, a metric in the context of monitoring and system performance is a quantifiable measure that is used to track and assess the status of a specific process or activity.
At its core, a metric in the context of monitoring and system performance is a quantifiable measure used to track and assess the status of a specific process or activity. As an open-source monitoring solution, Prometheus provides a robust data model for storing and querying metric data.
Metrics Structure in Prometheus:
Before diving into metric types, it's important to understand how Prometheus handles metric data. The Prometheus data model is designed to efficiently store time-series data, with each data point containing samples observations from your systems.
The structure of a metric typically includes the following key components:
Metric Name: An explicit identifier for the metric, often reflecting what it measures. For example, http_requests_total.
Labels: Key-value pairs that provide additional dimensions to the metric, enabling more detailed and specific tracking. An example label for http_requests_total could be {method="GET", endpoint="/api"}.
Metric Value: The actual data point representing the measurement, which could be a count, a duration, etc.
Timestamp: The point in time when the metric value was recorded (often added automatically by the monitoring system)
Consider a metric named user_logins_total —
This metric could have labels like {user_type="admin", location="EU"} and a numerical value indicating the total count of logins. The timestamp would denote when this count was recorded.
Types of Prometheus Metrics
Prometheus, through its various client libraries including Python, Go, and Java clients, primarily deals with four types of metrics:
Counter: A metric that only increases or resets to zero on restart. Ideal for tracking the number of requests, tasks completed, or errors.
Gauge: This represents a value that can go up or down, like temperature or current memory usage.
Histogram: Measures the distribution of events over time, such as request latencies or response sizes.
Summary: Similar to histograms but provides a total count and sum of observed values.
Lets dive deeper into each Prometheus metrics types.
Counters
A counter metric is a cumulative metric used primarily for tracking quantities that increase over time. A common example is tracking the total number of HTTP requests to your server. Said simply, a counter .. counts! Counters are ideal for monitoring the rate of events, like the total number of requests to a web server, task completions, or error occurrences. It is designed to only increase, which means its value should never decrease (except when reset to zero, usually due to a restart or reset of the process generating it).
Counters are often visualized in dashboards to show trends over time, like the rate of requests to a web service. For instance, They can trigger alerts if the rate of errors or specific events exceeds a threshold, indicating potential issues in the monitored system.
node_network_receive_bytes_total in Node Exporter is an example of a counter that can be used to track the total number of bytes received on a network interface.
Using PromQL functions like rate() or increase() to calculate the per-second average rate of increase of the counter over a period of time.
The rate() function in PromQL is used to calculate a metric's per-second average rate of increase over a given time interval.
To determine the total change in a metric over a specific time period, PromQL increase() calculates the cumulative increase of a metric over a given time range.
Thereset() function identifies the number of times a counter has been reset during a given time period.
Note on Counter Reset:
There are scenarios where a counter can reset. The most common reason for a counter reset is when the process generating the metric restarts. This could be due to a service restart, deployment, or a system reboot. When this happens, the counter starts from zero again.
💡
An upcoming feature in Prometheus adds created timestamp a metric to solve the long-standing issues with counter-resets. See the talk from Promcon 2023.
This reset behavior is crucial for understanding how to interpret counter data, especially when using functions like rate() or increase() in PromQL. These functions are designed to account for counter resets. They can detect a reset by identifying when the counter value decreases from one scrape interval to the next. Upon detecting a reset, these functions assume the counter has been set to zero and then started increasing again.
It's important to be aware of counter resets because they can impact how you interpret the data. A sharp drop in a counter value followed by an increase could be misinterpreted as a decrease in the metric being measured, when in reality, it's just a reset. Understanding this behavior is key to accurately monitoring and analyzing metrics in Prometheus.
Gauges
Gauges represent a metric that can increase or decrease, akin to a thermometer. Gauges are versatile and can measure values like memory usage, temperature, or queue sizes, giving a snapshot of a system's state at a specific point in time.
Gauges are straightforward in terms of updating their value. They can be set to a particular value at any given time, incremented or decremented, based on the metric they are tracking.
Gauges are often visualized using line graphs in dashboards to depict their value changes over time. They are handy for observing the current state and trends of what's being measured rather than the rate of change.
From the JMX Exporter, which is used for Java applications, a Gauge might be employed to monitor the number of active threads in a JVM labeled as jvm_threads_current.
When working with gauges, which can fluctuate up and down, specific functions are typically used to calculate statistical measures over a time series.
These functions include:
avg_over_time() - for computing the average
max_over_time() - for finding the maximum value
min_over_time() - for the minimum value
quantile_over_time() - for determining percentiles within the specified period
delta() - for the difference in the gauge value over the time series
These functions are instrumental in analyzing the trends and variations of gauge metrics, providing valuable insights into the performance and state of monitored systems.
Histogram
Histograms are used to sample and aggregate distributions, such as latencies. They use configurable buckets to sort measurements into predefined ranges, which can be adjusted based on your monitoring needs. Histograms are excellent for understanding the distribution of metric values and helpful in performance analysis, like tracking request latencies or response sizes.
Histograms efficiently categorize measurement data into defined intervals, known as buckets, and tally the number (i.e., a counter) of measurements that fit into each of these buckets. These buckets are pre-defined during the instrumentation stage.
A key thing to note in the Prometheus Histogram type is that the buckets are cumulative. This means each bucket counts all values less than or equal to its upper bound, providing a cumulative distribution of the data. Simply put, each bucket contains the counts of all prior buckets. We will explore this in the example below.
Let's take an example of observing response times with buckets —
We could classify request times into meaningful time buckets like
0 to 200ms - le="200" (less or equal to 200)
200ms to 300ms - le="300" (less or equal to 300)
… and so on
Prometheus also adds a +inf bucket by default
Let's say our API’s response time observed is 175ms; the count values for the bucket will look something like this:
Bucket
Count
0 - 200
1
0 - 300
1
0 - 500
1
0 - 1000
1
0 - +Inf
1
Here, you can see how the cumulative nature of the histogram works.
Let's say in the following observation our API’s response time is 300ms; the count values will look like this:
Bucket
Count
0 - 200
1
0 - 300
2
0 - 500
2
0 - 1000
2
0 - +Inf
2
It is essential to note the histogram-type metric's structure for properly querying it.
Each bucket is available as a “counter,” which can be accessed by adding a _bucket suffix and the le label. The suffix of _count and _sum are generated by default to help with the qualitative calculation.
_count is a counter with the total number of measurements available for the said metric.
_sum is a counter with the total (or the sum) of all values of the measurement.
The histogram_quantile() function calculates quantiles (e.g., medians, 95th percentiles) from histograms. It takes a quantile (a value between 0 and 1) and a histogram metric as arguments and computes the estimated value at that quantile across the histogram's buckets.
For instance, histogram_quantile(0.95, metric_name_here) estimates the value below which 95% of the observations in the histogram fall, providing insights into distribution tails like request latencies.
The histogram data type can also be aggregated, i.e., combining multiple histograms into a single histogram. Suppose you're monitoring response times across different servers. Each server emits a histogram of response times. You would aggregate these individual histograms to understand the overall response time distribution across all servers. This aggregation is done by summing up the counts in corresponding buckets across all histograms.
For example, you could use a PromQL query like this -
sum by (le) (rate(http_request_duration_seconds_bucket{endpoint="payment"}[5m]))
In this example, the sum by (le) part aggregates the counts in each bucket (le label) across all instances of the endpoint labeled "payment". The rate function is applied over a 5-minute interval ([5m]), calculating the per-second rate of increase for each bucket, which is helpful for histograms derived from counters. This query gives a unified view of the request duration distribution across all servers for the specified endpoint.
Native Histograms
Starting from Prometheus version 2.40, an experimental feature provides support for native histograms. With native histograms, you only need a one-time series, and it includes a variable number of buckets along with the sum and count of observations. This feature offers significantly higher resolution while being more cost-effective.
Summary
Summaries track the size and number of events, commonly used to calculate percentiles like the 99th percentile for latency monitoring. The total sum and count are automatically maintained for each summary metric. Summaries are ideal for calculating quantiles and averages; Summaries are used for metrics where aggregating over time and space is essential, like request latency or transaction duration.
A summary metric automatically calculates and stores quantiles (e.g., 50th, 90th, 95th percentiles) over a sliding time window. This means it tracks both the number of observations (like requests) and their sizes (like latency), and then computes the quantiles of these observations in real-time. A Prometheus summary typically consists of three parts: the count (_count) of observed events, the sum of these events' values (_sum), and the calculated quantiles.
Example -
# HELP http_request_duration_seconds The duration of HTTP requests in seconds
# TYPE http_request_duration_seconds summary
http_request_duration_seconds{quantile="0.5"} 0.055
http_request_duration_seconds{quantile="0.9"} 0.098
http_request_duration_seconds{quantile="0.95"} 0.108
http_request_duration_seconds{quantile="0.99"} 0.15
http_request_duration_seconds_sum 600
http_request_duration_seconds_count 10000
Summaries are better suited when you need accurate quantiles for individual instances or components and don't intend to aggregate these quantiles across different dimensions or labels. Compared to histograms, which are helpful when aggregating data across multiple instances or dimensions, like calculating global request latency across several servers.
A significant limitation of summaries is that you cannot aggregate their quantiles across multiple instances. While you can sum the counts and sums, the quantiles are only meaningful within the context of a single summary instance.
Summaries can be more resource-intensive since they compute quantiles on the fly and keep a sliding window of observations. Histograms can be more efficient regarding memory and CPU usage, especially when dealing with high-cardinality data. Since the bucket configuration is fixed, they can also be optimized for storage.
Visualization and Integration
While Prometheus provides powerful querying capabilities through PromQL, many organizations use Grafana as their primary visualization tool for Prometheus metrics. Grafana offers rich dashboarding capabilities and seamless integration with Prometheus data sources. You can also use Last9 to explore these metrics through a very user friendly navigation and dashboards.
Summing up
The fundamentals we have covered in this post around metrics types in Prometheus will hopefully help you better grasp your monitoring setup.
In previous posts, we have posted various posts covering the fundamentals of Prometheus Monitoring and Prometheus Cardinality.
If you or your team is looking to get started using Prometheus, you can consider hosted and managed prometheus offering that can help eliminate your cardinality and long-term storage woes while reducing your monitoring cost significantly.
Related Reading
Additional Resources
For more detailed information about implementing these metric types, refer to the official Prometheus docs and client-side documentation for your preferred programming language. The prometheus client libraries provide comprehensive examples for different metric implementations across various programming languages like Python, Go, and Java.