What is Thanos and How Does it Scale Prometheus?

Prometheus has become the go-to metric TSDB, but its long-term storage (LTS) limitations mean it must be incorporated with third-party scalers. Thanos is one of such third-party scalers, and this guide explains what Thanos is and how you can use it to scale Prometheus.

What is Thanos?

Thanos is an open-source CNCF project designed to extend Prometheus’s capabilities. It helps to overcome Prometheus’ limitations through the following features.

Data storage: Thanos provides long-term data backup features by making users choose various cloud storage solutions such as Amazon S3, Google Cloud Storage, Cassandra, and Microsoft Azure Blob storage, although managing it can become cumbersome.
Global Query View: Prometheus shards metrics to improve performance. Conversely, Thanos federates metrics and allows metrics from multiple sources, such as Prometheus servers and namespaces, without increasing latency.
High Availability: Thanos replicates data across various storage instances to prevent data loss or unavailability when a storage instance becomes problematic. This ensures that metrics remain highly available and fault-tolerant.

💡

Check out Levitate - which provides high cardinality workflows using streaming aggregation long-term retention and is built for HA and scalability as a potent alternative to Thanos. Get started for free today.

These features empower Thanos to scale and handle large workloads horizontally. To fully integrate with and enhance Prometheus’s functionality, Thanos is built on the following components: Sidecar, Store Gateway, Compactor, Querier and Query Frontend. The components and their functionalities are explained in the table below.

Component	Description
Sidecar	Is a stateless component that establishes a connection between Thanos and Prometheus. It reads metric data from Prometheus and streams it to the object storage for querying.
Store Gateway	Acts as a unified endpoint, aggregating metrics from various Prometheus instances and eliminating the need for complex federation configurations. It implements the StoreAPI on historical data storage buckets.
Compactor	Is responsible for deduplicating and downsampling data stored in storage buckets to reduce storage requirements and improve query performance. It also applies retention policies to manage the lifecycle of metric data.
Receiver	Is a stateful layer built on top of the Prometheus’ TSDB to store and process data from Prometheus's remote write-ahead log. It can also upload the data to cloud storage for redundancy and long-term storage.
Ruler/Rule	Evaluates recording and alerting rules against a selected query API and sends query results to the chosen long-term storage solution. It works alongside Prometheus’ Alertmanager to ensure consistent alerting across short and long-term data. You can define alerting rules in Prometheus instances or Thanos Store gateways.
Querier/Query	Is a stateless query command that uses the Prometheus API to aggregate metric data from multiple Prometheus instances and underlying store APIs. It eases the querying of long-term time-series data, and provides a unified view of metrics across the distributed system. It optimizes query performance by fetching only relevant data from underlying storage blocks.
Query Frontend	Acts as a proxy for Prometheus's API. It interfaces with the Querier while caching responses and optionally splitting them by queries per day. It improves query performance by reducing the load on the Querier.

Prometheus and Thanos constitute a powerful monitoring stack, allowing you to monitor, analyze, scale, and handle large workloads horizontally and scale your infrastructure efficiently.

But maintaining all these components yourself is not easy. Compare that with Levitate. Just change the remote-write endpoint in a Prometheus running in agent mode.

Scaling with Thanos

To get started, you need a working Prometheus deployment. Install Thanos as an additional component and configure its components based on your specific requirements and settings (e.g., Prometheus endpoints, object storage credentials and compaction rules).

Integrate Thanos via the following steps.

Step 1: Clone Thanos Repository

Clone the Thanos repository from GitHub.

git clone https://github.com/thanos-io/thanos.git

If you use Docker, navigate to the cloned Thanos repository and build the Thanos Docker images via the Dockerfiles using this command.

cd thanos
  docker build -t thanos:latest .

Step 2: Run Prometheus with Thanos Sidecar

Start Prometheus and configure it to scrape metrics from a Sidecar instance. Create a prometheus.yaml file with the following content.

global:
     scrape_interval: 15s
   scrape_configs:
   - job_name: 'prometheus'
     static_configs:
     - targets: ['localhost:9090']
   remote_write:
     - url: "http://sidecar:9201/api/v1/receive"

Run the following command to start Prometheus and mount the prometheus.yaml file.

docker run -d -p 9090:9090 -v /path/to/prometheus.yaml:/etc/prometheus/prometheus.yml --name prometheus prom/prometheus

Step 3: Run Thanos Sidecar

Start the Thanos Sidecar to connect to Prometheus and upload the data to object storage. Run the following command.

docker run -d -p 9201:9201 --link prometheus:prometheus -e OBJECT_STORAGE_CONFIG="/etc/config/object-storage.yml" -v /path/to/object-storage.yml:/etc/config/object-storage.yml thanos:latest sidecar --prometheus.url=http://prometheus:9090

Step 4: Configure Object Storage

Set up your preferred object storage provider account and obtain the required access credentials. Create object-storage.ymlyour preferred object storage provider account file to configure object storage integration. Replace my-bucket with the selected name and add the necessary certifications for your chosen object storage provider. Here's an example.

type: S3
   config:
     bucket: my-bucket
     endpoint: s3.amazonaws.com
     access_key: <your-access-key>
     secret_key: <your-secret-key>

Step 5: Run Thanos Store

Start the Thanos Store component by running the following command.

docker run -d -p 10901:10901 -e OBJECT_STORAGE_CONFIG="/etc/config/object-storage.yml" -v /path/to/object-storage.yml:/etc/config/object-storage.yml --name store thanos:latest store --data-dir=store --objstore.config-file=/etc/config/object-storage.yml

Step 6: Query Thanos

Run the Thanos Query component to query aggregated metrics from multiple Prometheus instances and object storage. Execute the following command.

docker run -d -p 9091:10901 --link store:store -e OBJECT_STORAGE_CONFIG="/etc/config/object-storage.yml" -v /path/to/object-storage.yml:/etc/config/object-storage.yml thanos:latest query --store=store:10901

Step 7: Access Thanos Web UI

Open your web browser and visit http://localhost:9091 to access the Thanos Query Web UI. From there, you can explore and query the aggregated metric data.

Step 8: Explore Advanced Features

Once the basic setup is working, you can explore advanced features like data compaction with Compactor, rule evaluation with Ruler, and query federation with Query Frontend.

How Does Thanos Scale Prometheus?

Via the mechanisms below.

1. Global Querying

Traditionally, Prometheus can only query data from a single instance at a time. Thanos enables the aggregation and querying of data from multiple Prometheus instances, allowing for cross-cluster and cross-region querying. This global visibility enables efficient querying and analysis of metric data across the entire distributed system.

3. Long-term Data Storage

Prometheus is primarily designed for short-term monitoring and storage. Thanos addresses Prometheus' limited local storage capacity issue by allowing older or less frequently accessed data to be offloaded to object storage systems. This frees up local storage resources. But Thanos performs downsampling of data, which means you may not have access to granular data over the long term. Levitate - our managed time series data warehouse, provides one year of long-term retention of metrics without any downsampling based on data tiering technology.

4. Query Federation

Thanos allows you to query data from multiple Prometheus instances and object storage backends as a unified view. When a query is made through Thanos, the Querier issues subqueries to relevant Prometheus instances based on specified Prometheus labels and time ranges. It then collects the results from each Prometheus instance and combines them to form a comprehensive response. This distributed query approach allows Thanos to handle global queries efficiently, even in distributed environments with large amounts of data.

5. Data compaction

Thanos employs downsampling and data deduplication before storing data in a queryable format called a block to reduce storage requirements.

Thanos Best Practices

When working with Thanos, here are some best practices to consider.

Plan for scalability

Consider the expected growth in data volume and query load while deploying Thanos. Scale the individual components—such as Sidecar, Query, Store, and Querier—based on your requirements and test to ensure optimal changes. Also, consider performance, cost, and durability factors when choosing object storage systems because they soon become cost-prohibitive.

Configure object storage caching

Enable caching in the Query Frontend component to reduce the load on the Querier. The Query Frontend can cache the responses to common queries to improve query performance.

Levitate supports delta caching based on data tiering to make queries perform faster.

Optimize compaction and retention policies

Tune Thanos' compacting and retention features to balance the trade-off between resource utilization and query performance. Experiment with different configurations to find the optimal workload and storage capacity settings,.

Replication, backup, and disaster recovery

To enable high availability, create one or two extra replicas of every Prometheus pod in various hosting regions. In addition, regularly backup object storage data and store backups in a separate location. Test the restoration process periodically to ensure data recoverability. This will result in additional costs though, so plan it as per your observability budget.

Stay up-to-date

Participate in Thanos community forums and mailing lists to share ideas, ask questions, and learn from others' experiences. Consult the official Thanos documentation for detailed instructions, configuration options, troubleshooting guides, and best practices. Follow Thanos’ release notes and changelog to stay informed about bug fixes, performance improvements and new features. If you don't have a dedicated SRE/DevOps team to manage the cluster, it may be beneficial to consider one of the managed Prometheus solutions.

Conclusion

As software monitoring requirements evolve, Prometheus alone cannot provide the scalability, high availability, and long-term data storage required. Adopting Thanos enables organizations to benefit from both Prometheus’ and Thanos’ multiple complementary features. This will further smoothen the application monitoring process and guarantee higher availability of applications. But running Thanos comes with its challenges around maintenance and upkeep. In such cases, a hosted Prometheus, solution like Levitate is a better option.