Jun 12th, ‘23/10 min read

Prometheus vs Thanos

Everything you want to know about Prometheus and Thanos, their differences, and how they can work together.

In the world of monitoring and observability, Prometheus and Thanos have emerged as two powerful tools for handling time series data. Both of these systems offer robust features and functionalities that help organizations gain valuable insights into their infrastructure and applications.

But deciding between Prometheus and Thanos can be a daunting task, as each has a unique set of advantages, as well as drawbacks. In this blog, we'll delve into the characteristics, differentiators, and challenges associated with Prometheus and Thanos, in order to help you make an informed decision between these two titans.

What is Prometheus?

Prometheus is an open-source monitoring and alerting system. It was originally developed at SoundCloud, the online music streaming and distribution platform when the company discovered that its existing metrics and monitoring tools were unable to meet their needs.

So, in developing Prometheus, SoundCloud designed it to collect and store time series data and provide real-time metrics for monitoring and analysis. Prometheus uses a pull-based model to scrape metrics from targets such as applications, services, and infrastructure components.

With its flexible query language called PromQL, Prometheus allows users to retrieve and analyze the collected metrics efficiently. It also provides robust support for alerting, enabling users to define custom rules and receive notifications when certain conditions are met.

Prometheus is designed as a single-server architecture, where each instance is responsible for collecting, storing, and querying data. It follows a pull-based model, with targets providing the data to the Prometheus server. The data is stored in memory and on disk, with a retention period defined for automatic data expiration.

Prometheus Components

Prometheus Server: Responsible for collecting time series data through scraping targets, storing the data locally in its time series database (TSDB), and evaluating alerts and rules defined by the user.
Exporters: Specialized components that expose metrics of various systems and services in a format that Prometheus can understand. Exporters allow Prometheus to monitor a wide range of technologies, such as databases, web servers, and cloud platforms.
Alertmanager: Handles alert notifications generated by Prometheus based on predefined rules. It allows for advanced alert routing, deduplication, grouping, and silencing, ensuring timely and efficient delivery of alerts to the appropriate channels.
Pushgateway: Provides a way to push time series data to Prometheus instead of using the standard scraping mechanism. It is useful for short-lived jobs or batch processes that cannot be scraped directly.
Grafana (optional): A popular visualization and monitoring tool that can be integrated with Prometheus. Grafana allows users to create custom dashboards and visualize Prometheus metrics in a visually appealing and interactive manner.
Service Discovery: Prometheus supports various service discovery mechanisms, including static config files, DNS-based discovery, and integrations with cloud orchestration platforms like Kubernetes. These mechanisms help Prometheus dynamically discover and monitor targets without manual config.
Prometheus Alerting Rules: Users can define alerting rules in Prometheus using the PromQL query language. These rules are evaluated continuously against the collected time series data, generating alerts when specified conditions are met.
Federation: Prometheus supports federation, allowing multiple Prometheus servers to be connected and share data. This enables a hierarchical and distributed monitoring setup, where a central Prometheus server can aggregate data from multiple remote instances.
Remote Read and Write APIs: Prometheus provides remote read and write APIs, allowing integration with external systems and enabling interactions with the Prometheus server programmatically.
PromQL: The query language of Prometheus, PromQL, allows users to retrieve and manipulate time series data. It provides powerful functions and operators for filtering, aggregating, and transforming metrics, facilitating advanced data analysis and visualization.

These components form the core architecture of Prometheus, enabling it to collect, store, and analyze time series data, as well as generate alerts and provide insights into the monitored systems and services.

What is Thanos?

Thanos, also an open-source project, is an extension built for Prometheus that aims to address some of the challenges associated with long-term storage and high availability.

Thanos provides a highly available Prometheus setup with extended storage capabilities, which enables organizations to store and query historical data efficiently. In order to achieve this, it's introduced additional components such as the Thanos Sidecar and Thanos Store, which enhance the scalability and durability of Prometheus deployments. Thanos consists of several components, including Thanos Sidecar, Thanos Query, Thanos Store, and Thanos Compact, which work together to create a distributed, fault-tolerant, and scalable time series database.

By leveraging a distributed architecture and integrating with object storage systems like Amazon S3 or Google Cloud Storage, it allows for seamless horizontal scalability. Thanos enables federated queries across multiple Prometheus instances, making it ideal for handling massive volumes of time series data.

Thanos Components

Thanos Sidecar: Acts as a sidecar component proxy for Prometheus instances, enabling long-term storage by pushing data to object storage and facilitating global query federation across multiple Prometheus servers.
Thanos Querier: Serves as the central query engine in the Thanos architecture, allowing users to execute queries across multiple Prometheus instances, providing a unified view of time series data.
Thanos Store: Stores time series data in object storage, such as Amazon S3 or Google Cloud Storage, and provides efficient read access to the data for Thanos Querier. It handles data deduplication and compression, optimizing storage utilization.
Thanos Compact: Performs compaction of time series data in object storage, reducing storage overhead and improving query performance by removing redundant data blocks and chunks.
Thanos Ruler: Offers extended rule-based alerting capabilities for the Thanos ecosystem, allowing users to define complex alerting rules and evaluate them across distributed Prometheus instances.
Thanos Receiver: Provides an endpoint for ingesting data from remote Prometheus instances and stores it in object storage. It enables efficient and reliable ingestion of data for long-term storage and analysis.
Thanos Compactor: Manages the compaction process by merging and downsampling data blocks, improving query efficiency and reducing storage requirements in the long-term storage layer.
Thanos Bucket: Allows data to be stored in a time partitioned manner, improving query performance by organizing data into manageable buckets based on time ranges.
Thanos Query Frontend: Serves as the user-facing component, receiving query requests and distributing them to Thanos Querier instances. It provides a web-based interface for executing queries and visualizing the results.
Thanos Sidecar Downsample: Offers downsampling capabilities for Thanos Sidecar, reducing the granularity of stored time series data to improve query performance for longer time ranges.

These components collectively form the Thanos architecture, providing enhanced scalability, fault tolerance, long-term storage, and global querying capabilities to Prometheus deployments.

Difference between Prometheus and Thanos ?

Prometheus and Thanos have several key differences that set them apart in terms of functionality and use cases. Here are some of the primary differentiators:

Scaling and Long-term Storage: Prometheus is designed for short-term monitoring and relies on local storage, limiting its ability to handle large amounts of historical data. In contrast, Thanos extends Prometheus by introducing a distributed storage layer, allowing for scalable, long-term storage and query capabilities.
High Availability: While Prometheus operates in a single-server mode, Thanos provides high availability through its distributed architecture. By leveraging components like the Thanos Store and Sidecar, Thanos ensures redundancy and fault tolerance, enabling seamless querying and retrieval of data even in the face of failures.
Querying and Analysis: Prometheus offers a powerful query language called PromQL, which provides expressive capabilities for retrieving and analyzing time series data. Thanos builds upon this by extending PromQL to support querying data across multiple Prometheus instances, enabling seamless federation and aggregation of metrics.
Retention and Downsampling: Thanos excels in long-term data retention by leveraging object storage solutions like Amazon S3 or Google Cloud Storage. This allows organizations to store vast amounts of data cost-effectively, whereas Prometheus primarily relies on local disk storage for short-term retention.
Integration and Ecosystem: Prometheus has a rich ecosystem with numerous integrations and exporters available, making it well-suited for monitoring Kubernetes and cloud-native environments. Thanos, being an extension of Prometheus, inherits many of these integrations while providing additional functionality for scalability and long-term storage.
Querying: Prometheus uses PromQL as its query language, allowing for powerful and flexible queries. Thanos extends PromQL to support querying data across multiple Prometheus instances, enabling cross-instance aggregation and federation in the form of Thanos Querier.
Recording Rules: Prometheus supports recording rules, which allow users to pre-calculate and store frequently used queries as new time series. This can optimize query performance and simplify complex calculations. Thanos inherits this feature from Prometheus and maintains compatibility.
Retention: Prometheus primarily relies on local disk storage for short-term retention of time series data. Thanos, on the other hand, leverages object storage solutions like Amazon S3 or Google Cloud Storage, enabling long-term retention of data.
Downsampling: Prometheus supports downsampling, which allows for aggregating data over larger time intervals to reduce storage requirements and query complexity. Thanos inherits this capability from Prometheus, ensuring compatibility and flexibility in managing data granularity.

Advantages of using Prometheus over Thanos

Simplicity: Prometheus is relatively easy to set up and operate, making it an excellent choice for small to medium-sized deployments. Its single-server mode allows for straightforward installation and configuration without the need for additional components.
Real-time Monitoring: Prometheus excels at real-time monitoring, providing instant visibility into the state of your systems and applications. With its powerful alerting system, you can set up custom rules to receive notifications and take immediate action when anomalies or issues occur.
Rich Query Language: PromQL, the query language used by Prometheus, offers a wide range of functions and operators that allow for complex data analysis and aggregation. This makes it easier to extract valuable insights from your time series data and perform advanced monitoring tasks.

Extensive Ecosystem: Prometheus has a thriving community and a vast ecosystem of exporters, integrations, and tools. It integrates seamlessly with popular technologies like Kubernetes, making it a go-to choice for monitoring containerized environments. The extensive ecosystem ensures that you can find plugins and solutions for almost any use case.

Advantages of using Thanos over Prometheus

Scalability and High Availability: Thanos addresses one of the main limitations of Prometheus by providing horizontal scalability and high availability. With Thanos, you can scale your Prometheus deployments and handle larger workloads without sacrificing performance or risking data loss.
Long-term Storage: Thanos introduces the ability to store and query historical data over extended periods. By leveraging object storage solutions, you can retain data for months or even years, allowing for trend analysis, capacity planning, and compliance requirements.
Fault Tolerance and Disaster Recovery: Thanos employs a distributed architecture with redundancy and fault tolerance mechanisms. This ensures that even if a Prometheus instance or component fails, data remains available and queryable, reducing the risk of data loss and ensuring business continuity.
Global View and Federation: Thanos enables federation across multiple Prometheus instances, providing a global view of your metrics and facilitating centralized monitoring and analysis. This is particularly useful in large-scale deployments with geographically distributed clusters.

Prometheus with Thanos

Prometheus and Thanos can work together seamlessly using Prometheus remote write functionality to enhance the overall capabilities of the monitoring and storage infrastructure. Here's how they collaborate:

1. Prometheus Configuration:

In the Prometheus configuration file, you can configure remote write settings to specify the endpoint where Prometheus should send its time series data. This endpoint can be a Thanos Sidecar or Thanos Store.

2. Thanos Sidecar:

Thanos Sidecar, acting as a proxy, receives the remote write data from Prometheus and forwards it to Thanos Store for long-term storage. It ensures that the data is properly compressed, serialized, and pushed to the designated object storage system, such as Amazon S3 or Google Cloud Storage.

3. Thanos Store:

Thanos Store is responsible for storing time series data in object storage. It receives the data from Thanos Sidecar and persists it in a scalable and durable manner. Thanos Store supports efficient querying and retrieval of the stored data, which can be later used for analysis, visualization, or long-term historical monitoring.

4. Querying and Analysis:

Thanos Query, the central query engine of Thanos, can perform global queries across multiple Prometheus instances and Thanos Stores. It provides a unified view of the time series data, allowing users to analyze metrics from both real-time and historical perspectives. Users can utilize PromQL, the query language of Prometheus, to execute queries and retrieve the desired information.

By combining Prometheus and Thanos through remote write integration, organizations can achieve the following benefits:

- Long-term Storage: Prometheus offloads its time series data to Thanos Store, allowing for cost-effective and scalable long-term storage of metrics.

- Global Querying: Thanos Query enables users to perform queries that span across multiple Prometheus instances and Thanos Stores, providing a consolidated view of the time series data. This facilitates efficient analysis and monitoring across distributed environments and extended time periods.

- Scalability: Thanos leverages its distributed architecture and object storage systems to scale storage and query capabilities, accommodating growing amounts of data and ensuring optimal performance.

- High Availability: The fault-tolerant design of Thanos, combined with the use of remote write, ensures data reliability and availability, even in the presence of failures in individual Prometheus instances or Thanos components.

In summary, by utilizing remote write integration, Prometheus can seamlessly work with Thanos, leveraging its long-term storage and global querying capabilities. This collaboration enhances the scalability, durability, and analytical capabilities of the monitoring infrastructure, providing a comprehensive solution for handling time series data.

Conclusion

While Prometheus and Thanos can complement each other effectively, there are a couple of challenges that organizations might face when using Thanos with Prometheus:

Complexity and Learning Curve

Integrating Thanos with Prometheus introduces additional components, configurations, and dependencies, which can increase the complexity of the monitoring infrastructure. Administrators and operators need to understand the architecture and deployment considerations of both Prometheus and Thanos. There might be a learning curve associated with setting up and managing the Thanos components, especially for those who are new to Thanos. Adequate documentation, training resources, and community support can help mitigate this challenge.

Increased Operational Overhead

Introducing Thanos alongside Prometheus adds operational overhead. Managing and maintaining a distributed architecture, including Thanos Sidecar, Thanos Store, and Thanos Query, requires additional monitoring, upgrades, and troubleshooting. Organizations need to allocate resources and expertise to ensure the smooth operation of the Thanos components. The increased complexity and dependency on external storage systems, such as object storage, also require careful configuration and monitoring to avoid data loss or performance issues.

Levitate - Last9’s time series data warehouse can be good fit instead of Thanos in such cases for long term storage as it provides managed offering with SLAs and global availability, support for high cardinality with automatic data tiering.

Recommended reading - do read our blog InfluxDB vs Prometheus for a similar analysis. In another blog, we are comparing all the popular time series databases. Go check them out.

💡

The Last9 promise — We will reduce your TCO by about 50%. Our managed time series ~~database~~ data warehouse, Levitate, comes with streaming aggregation, data tiering, and the ability to manage high cardinality. If this sounds interesting, talk to us.

Stay updated on the latest from Last9.

Authors

Last9

Last9 helps businesses gain insights into the Rube Goldberg of micro-services. Levitate - our managed time series data warehouse is built for scale, high cardinality, and long-term retention.

Topics

Prometheus TSDB Comparison Thanos