Dec 8th, 2022

How to improve Prometheus remote write performance at scale

Deep dive into how to improve the performance of Prometheus Remote Write at Scale based on real-life experiences

Contents

Prometheus remote write is the de-facto method for sending metrics from your local Prometheus setup to your chosen long-term metric storage. This post walks through a scenario explaining how to reduce excess parallelism in remote writes without impacting the throughput of ingested samples.

Prometheus Remote write basics

These are the typical actions involved in a Prometheus remote write cycle:

  1. Prometheus sender scrapes targets as per scrape config.
  2. It stores samples as per the locally defined retention period.
  3. It remote writes samples as per remote_write config block to remote write receiver (e.g. Last9, Grafana cloud, Victoriametrics, etc.)
Untitled-2022-11-13-1826.png

For the purpose of brevity, we will refer to Prometheus remote write sender as sender and Prometheus remote write receiver as receiver for the rest of the post.

One of the most common failure scenarios in the remote write system is an increased sender ingestion rate. This can happen due to the following reasons:

  1. New targets were discovered as new infra got provisioned e.g. new product launch, autoscaling existing infra, etc.
  2. An application caused a cardinality explosion by increasing dynamic values for one or more labels e.g customer ids spike caused by bulk onboarding customers.

This increased ingestion rate causes a spike in remote write rate, which can cause one of the following common scenarios:

  1. Both the sender and receiver continue operating normally.
  2. The sender couldn’t keep up with the increased ingestion rate and it crashed before it could remote write.
  3. The sender survived the surge but the receiver couldn’t keep up with the increased remote write throughput or concurrency and crashed, thereby causing backlog build-up on the sender end. When the receiver came up, it was hit with current metrics as well as the previous backlog (provided the sender didn’t discard the backlog due to timeouts).

For scenarios 2 and 3 - the solutions seem obvious - when the sender fails, scale up resources on the sender end and when the receiver fails, do the same for the receiver. This may do the trick in the interim, but you cannot keep on scaling infinitely and the obvious tunables for scale-up are limited e.g. CPU and memory.

What if the obvious tunable scale-ups don’t fix the problem? What if the tunable isn’t CPU or memory but network?

Problem

The last question is exactly the scenario we ran into. We had a customer who had around 180 Prometheus instances and planned to remote write to Last9. Each customer Prometheus instance had an ingestion rate of around 250K samples per minute causing the total throughput to Last9 to be 180 X 250000 = 45 million samples per minute. You can also find out your Prometheus ingestion rate by running a simple Prometheus query

__PLACEHOLDER_26__

Ingesting N million samples from a single sender ≠ Ingestion N/M million samples from M senders.

The former can take advantage of the predictability of the single sender’s traffic while the latter has increased concurrency and a higher probability of a subset of M senders deviating from the expected pattern.

Untitled-2022-11-27-2243.png

During our load tests in staging infra, we found that along with some boilerplate infra scale-up work, we also needed to reduce network communication overhead e.g. reduce connection establishment, teardown, etc. by increasing payload size and reducing parallelism, without impacting samples ingested throughput. This means fine-tuning the sender's Prometheus remote write config.

To illustrate the need, we wanted to do the following:

Untitled-2022-11-27-2252.png

Solution

The tunables to achieve this is called out in the Prometheus remote write tuning doc:

__PLACEHOLDER_27____PLACEHOLDER_28__

as well as in Grafana’s remote writing tuning post:

__PLACEHOLDER_29__

Let us try to understand this better.

Untitled-2022-11-27-2331.png
  1. Write ahead log writes to multiple queues - Q1, Q2, Q3, etc.
  2. Queues have shards - sample-1, sample-2, etc.
  3. Shards have metric samples - sample1, sample2, sample3, etc.
  4. capacity controls shard memory queue size per shard.
  5. max_samples_per_send control batch size of a single remote write.
  6. max_shards controls how many batches are sent in parallel.

Reducing max_shards will reduce the concurrency and increase capacity and max_samples_per_send will increase the packet size, thereby balancing throughput.

There is one more vital factor to consider - how tuning these parameters has an impact on the sender’s memory. As we are tuning queue size and remote write batch size, we are playing with what is held in memory. This is clearly explained here

__PLACEHOLDER_36__

Let us tweak these numbers and see the impact on memory:

Sr. No. Metric name default suggested change
1. capacity 2500 5000
2. max_shards 200 100
3. max_samples_per_send 500 1000
4. max_shards * (capacity + max_sampler_per_send) 600,000 600,000

We reduced parallelism (max_shards)  and increased payload size (capacity and max_samples_per_send ) without impacting memory utilization - 600,000 =~ 500 KB in both cases.

We did the change and achieved the same throughput with reduced network overhead.

Fin. Nope! This is where it gets interesting.

Don’t take my word for it. Let’s create a docker-compose setup which actually does this scenario.

Last9spin

Demo setup

Clone https://github.com/saurabh-hirani/prometheus-remote-write-tuning-parallelism

This docker-compose setup contains

  1. avalanche container to generate random metrics.
  2. prometheus container to scrape metrics generated by avalanche.
  3. remote-write-receiver container to accept remote write requests and dump. the requests received in the first minute to the local disk.
Untitled-2022-11-27-2304.png

Default config demo

./run.sh default

This uses prometheus.yaml.default  which has the following remote write queue config

__PLACEHOLDER_47__

After a minute’s worth of running, you will see the following output

__PLACEHOLDER_48__

Check ./output/default

__PLACEHOLDER_50__

  • With the default config, a minute's worth of remote write sent total samples of 1.1 MB, divided amongst 55 requests each of an average size of 15.93 KB.

Updated config demo

./run.sh updated

This uses prometheus.yaml.updated which has the following remote write queue config

__PLACEHOLDER_53__

After a minute’s worth of running, you will see the following output

__PLACEHOLDER_54__

Check ./output/updated

__PLACEHOLDER_56__

  • With the updated config, a minute's worth of remote write sent total samples of 960 KB, divided amongst 30 requests each of an average size of 29 KB.

Conclusion

Sr. No. Description default config updated config
1. Total files 55 30
2. Total directory size 1.1 MB 960 KB
3. Average file size 15 KB 29 KB

The updated Prometheus config which doubles capacity and max_samples_per_send plus halves max_shards leads to roughly the same amount of throughput being served in half the number of requests. This reduces network overhead issues as mentioned earlier.

Key Learnings

  1. Don’t just do something - stand there! - think about the problem at hand. Throwing resources at a problem without understanding the underlying cause is both easy and dangerous.
  2. Managed services need to be managed - working with customers to fine-tune their remote write settings, reducing cardinality explosion, etc. are extremely important parts of your onboarding cycle. Installing the latest observability fad will not give you extra 9s in your reliability - similarly, blindly opening your flood gates to managed TSDBs without knowing your tunables will lead to unknown problems.
  3. Demo setups ftw! - in absence of live production traffic, if you can create a demo setup to crunch those numbers, invest that time. You never know which new thing you will learn.
  4. Exposing the right system tunables requires a lot of thought and rigour. Hat tip to the Prometheus team for doing a great job of it!
About the authors
Saurabh Hirani

Saurabh Hirani

Solution Architect and SRE at Last9

Last9 keyboard illustration

Start observing for free. No lock-in.

OPENTELEMETRY • PROMETHEUS

Just update your config. Start seeing data on Last9 in seconds.

DATADOG • NEW RELIC • OTHERS

We've got you covered. Bring over your dashboards & alerts in one click.

BUILT ON OPEN STANDARDS

100+ integrations. OTel native, works with your existing stack.

Gartner Cool Vendor 2025 Gartner Cool Vendor 2025
High Performer High Performer
Best Usability Best Usability
Highest User Adoption Highest User Adoption