Prometheus is a popular open-source monitoring system. In this blog, we'll cover the basics of Prometheus monitoring, including its architecture, key features, and alternatives.
Jan 31st, ‘23 / 7 min read
Due to their contributions to observability, time-series databases (TSDB) have become prevalent over the past few years. Time series databases monitor metrics that aid in tracking and comprehending the performance of applications. According to 2022 research from the enterprise strategy group, mastering observability enables developers to release 60% more products than other developers.
Prometheus stands out among the different open-source time series databases and has become the industry standard for end-to-end monitoring solutions.
This article explores the following topics.
- Basics of Prometheus monitoring
- Prometheus Architecture
- Prometheus multi-dimensional data model
- Long-term storage of telemetry data
- Prometheus alerting
- Prometheus exporters
- Alternatives to Prometheus
What is Prometheus?
Prometheus is a systems and service monitoring system for cloud-native environments. Initially developed by SoundCloud in 2012, it has merged with Cloud Native Computing Foundation(CNCF) since 2016 as the second most popular project after Kubernetes.
Prometheus collects metrics from targets at specified intervals using a server that flattens all data into untyped time series data. Prometheus offers a pull-based data model for scraping metrics, built-in alerting features using alertmanager, a multi-dimensional data model, PromQL (the Prometheus query language), tons of integrations, and a robust open-source community, all of which have helped the tool become widely used in cloud-native monitoring. The use cases for Prometheus span several industries, including DevOps, finance, healthcare, and real-time tracking.
Prometheus tracks and reports a program’s performance by scraping metrics data from one or more targets over HTTP endpoints. This target could be an in-house instrumented application or even a third party.
Prometheus stores the metrics data and examines scraped data using its query language, PromQL. Here is a graphical illustration of the Prometheus architecture:
Multi-dimensional Data Model
Prometheus’s data model is different from other monitoring systems. Each scraped data is allocated a metrics name and a bunch of key-value pairs called labels. This helps you identify the time series from each other.
Time series is nothing but a stream of timestamped values belonging to the same metric and the same set of labelled dimensions. Besides stored time series, Prometheus may generate temporary derived time series as the result of queries.
The metric name and labels uniquely identify each time series. A metric name is an identifier for what kind of data is being recorded, eg.
http_requests_total which denotes total HTTP requests.
Labels power Prometheus's dimensional data model. Each combination of labels(key-value pairs) for the same metric name identifies a metric uniquely. For example, all HTTP requests that used the method
POST to the
/api/users route will have
path=/api/routes as labels for metric
http_total_requests. The query language allows filtering and aggregation based on these dimensions. Changing any label value, including adding or removing a label, will create a new time series.
Prometheus Query Language
Prometheus supports querying the data using Prometheus Query language(PromQL). It allows accessing data in 3 ways, either as a graph for visualization, as a tabular view or via the HTTP API.
Prometheus Metric Types
Put, metrics are numerical measurements bound to change with the timestamp. The metrics that you may require vary, depending on the program. If you are developing a web server, your metric could be the number of HTTP requests. But if you are creating a database, it could be the number of open connections, queries, and read/write latency.
Metrics help you understand why your application behaves the way it does. The four main metric types in the Prometheus client library are as follows:
Counters are used to monitor how frequently an event occurs within a program. They help monitor Prometheus metrics that increase monotonically and are exposed as time series. For example,
http_requests_totalwhich reports the total number of HTTP requests to an endpoint in an API, is a counter metric.
Gauges periodically measure a metric or take a snapshot at a specific moment. A gauge's value is not ever-increasing; it can go up or down over time. Gauges help you query metrics with rise-and-fall numerical tendencies. An example of a measure you can query with gauges is the temperature or CPU utilization percentage.
After sampling the data based on frequency or count, histograms help you group the values into specified buckets. The buckets monitor the event latency, which is the distribution of an event across several occurrences. You can select the buckets or allow the Prometheus client library to use a set of default buckets. Each bucket corresponds to a time series, and you can expand the capacity of the default buckets if your application requires more values.
Histograms have a few drawbacks, the biggest one being the requirement that you pre-define boundary values for your histogram buckets. Since code updates automatically trigger the buckets, you must estimate the latency range per distribution to keep your program’s cardinality under check.
Summary, like histograms, capture distributions of an attribute across multiple events. However, unlike histograms, they give you the exact quantile values instead of estimates. They work best when you need the same latency value, such as P99, and do not want to avoid going through the hassles of setting histogram buckets.
Although they sound more efficient, do not use summaries instead of histograms. Quantile values (provided by summaries) cannot be averaged, and their coverage of different periods cannot be readily determined.
How Prometheus scrapes metrics
Prometheus uses a pull-based model to scrape metrics. It scrapes the targets periodically to get new metrics. You can manage the scrape interval config to control how frequently Prometheus should scrape metrics.
Prometheus Exporters help monitor your programs when it is impossible to instrument them directly with Prometheus. They allow you to gather metrics from a particular third-party program and provide them to Prometheus servers for collection. In contrast to traditional monitoring systems, which rely on agents or embedded instrumentation to gather data and "push" metrics to the monitoring backend, Prometheus servers pull data from instrumented applications and Prometheus exporters. An example of the most popular exporter is the node exporter, which can help collect metrics from Linux machines.
You can check a lot of exports on Github in the awesome-prometheus repository.
Kubernetes monitoring using Prometheus
For Kubernetes-based systems, you can use the Prometheus operator. It allows dynamic service discovery of services in k8s env based on metadata config of pods, services, etc. The purpose of this project is to simplify and automate the configuration of a Prometheus-based monitoring stack for Kubernetes clusters.
The Prometheus server transmits alerts to the Alertmanager. The alert manager takes care of deduplication, grouping, inhibition, quiet, and forwarding alerts to the appropriate recipient. The Prometheus server periodically examines PromQL phrases known as alerting rules. When a user-defined alerting rule triggers an alert scenario, the Prometheus server delivers alerts.
Alertmanager can route the alerts to the proper notification channels, such as email, or third-party alerting applications such as PagerDuty, Slack, or OpsGenie. Recently support was also added for modern destinations such as Discord.
Prometheus has an
expression browser where you can input any query and view the results in a table or a simulation of graphs.
Recently, Promlens was donated to the Prometheus project and will be part of upcoming Prometheus releases to help the in-built visualization.
Visualization using Grafana
Alternatively, you can use Grafana to create dashboards to visualize metrics stored in Prometheus. Grafana is a multi-platform open-source analytics and interactive visualization web application. Grafana can be installed independently as an open-source package, and Prometheus data source can be configured inside it to fetch time series data from Prometheus using PromQL.
Long term storage
Prometheus includes a local on-disk time series database for storage. As per Prometheus docs:
Again, Prometheus's local storage is not intended to be durable long-term storage; external solutions offer extended retention and data durability.
Prometheus allows long-term storage of data by remote writing it in third-party services, such as Mimir, Cortex, and Levitate. You can run Prometheus in agent mode, and remote write the data to long-term storage. This also helps reduce the operational overhead of running Prometheus.
There are a plethora of alternatives to Prometheus. From InfluxDB to Splunk, Zabbix, and Cassandra to VictoriaMetrics. However, these available tools focus on various facets of problems. Some assist with analytics, while others work with logs. Some serve as data aggregators, while others offer a monitoring interface and complete monitoring solutions.
These tools, like Prometheus, have specific use cases and drawbacks. For example, as Prometheus has scaling issues that arise from the predetermined buckets explained earlier, InfluxDB does not have an AlertManager for alerting purposes. With 60% of software engineers agreeing that most monitoring tools fail to enable a unified and complete view of the application’s performance, it is glaring that the issues they present are universal. It has become essential to use more potent alternatives to these famous TSDBs.