Last9 Last9

Sep 16th, ‘24 / 4 min read

Optimizing Prometheus Remote Write Performance: Guide

Master Prometheus remote write optimization. Learn queue tuning, cardinality management, and relabeling strategies to scale your monitoring infrastructure efficiently.

Optimizing Prometheus Remote Write Performance: Guide

Prometheus has become a cornerstone of modern observability stacks, particularly in cloud-native environments. As organizations scale their infrastructure, the ability to efficiently store and query long-term metrics becomes crucial. This is where Prometheus remote write shines, allowing you to send metrics to external storage systems for long-term retention and analysis.

However, as your metrics volume grows, you may encounter performance bottlenecks in your remote write pipeline. This guide will walk you through various strategies to optimize your Prometheus remote write performance, ensuring your monitoring system scales alongside your infrastructure.

Understanding Remote Write Performance Bottlenecks

Before diving into optimization techniques, it's essential to understand common performance issues and the factors that affect remote write performance.

Common Performance Issues:

  • High CPU and memory usage on Prometheus servers
  • Increased network bandwidth consumption
  • Growing remote write queues
  • Delays in metric availability in remote storage

Factors Affecting Performance:

  • Volume of metrics being collected
  • Cardinality of time series
  • Network latency and reliability
  • Remote storage write capacity
  • Prometheus server resources (CPU, memory, disk I/O)

Key Metrics to Monitor:

  • prometheus_remote_storage_samples_in_total
  • prometheus_remote_storage_samples_dropped_total
  • prometheus_remote_storage_queue_length
  • prometheus_remote_storage_sent_batch_duration_seconds

Queue Configuration Optimization

The remote write queue acts as a buffer between Prometheus and your remote storage (example: Last9). Proper queue configuration is crucial for smooth operation.

Key Queue Parameters:

  • capacity: Maximum number of samples in the queue
  • max_samples_per_send: Maximum number of samples per request
  • batch_send_deadline: Maximum time samples will wait in the queue
  • min_shards and max_shards: Control concurrency of remote write

Best Practices:

  • Start with conservative values and adjust based on observed performance
  • Increase capacity for larger buffers, but be mindful of memory usage
  • Adjust max_samples_per_send based on your network and remote storage capabilities
  • Set batch_send_deadline to balance between latency and efficiency

Example configuration:

remote_write:
  - url: 'http://remote-write-endpoint'
    queue_config:
      capacity: 100000
      max_samples_per_send: 10000
      batch_send_deadline: 5s
      min_shards: 1
      max_shards: 10

Data Cardinality Management

High cardinality can severely impact remote write performance. Each unique combination of labels creates a new time series, potentially leading to millions of series.

Strategies for Reducing Cardinality:

  1. Review and refine your labeling strategy
  2. Use recording rules to pre-aggregate high cardinality metrics
  3. Implement a cardinality limiter in your Prometheus configuration
  4. Use Last9 Control plane to not change instrumentation and make run time changes.

Example Recording Rule:

groups:
  - name: example
    rules:
      - record: job:http_requests_total:sum
        expr: sum(http_requests_total) by (job)

This rule pre-aggregates the http_requests_total metric by job, reducing cardinality.

Effective Use of Relabeling

Relabeling allows you to modify labels before metrics are sent to remote storage, helping to reduce data volume and cardinality.

Relabeling Strategies:

  1. Drop unnecessary metrics
  2. Remove high-cardinality labels
  3. Aggregate metrics at ingestion time

Example configuration:

remote_write:
  - url: 'http://remote-write-endpoint'
    write_relabel_configs:
      - source_labels: [__name__]
        regex: 'temp.*'
        action: drop
      - regex: '(id|uuid)'
        action: labeldrop

This configuration drops all metrics starting with "temp" and removes the "id" and "uuid" labels from all metrics.

Network and Data Transfer Optimization

Efficient data transfer is crucial for remote write performance, especially when dealing with high-latency or unreliable networks.

Optimization Techniques:

  1. Enable compression
  2. Use persistent connections
  3. Implement retry mechanisms with exponential backoff

Example configuration with compression:

remote_write:
  - url: 'http://remote-write-endpoint'
    remote_timeout: 30s
    compression: snappy

Remote Storage Considerations

The choice of remote storage can significantly impact your remote write performance.

Factors to Consider:

  • Write throughput capacity
  • Query performance
  • Data retention policies
  • Scalability and operational complexity

Popular remote storage options include Thanos, Cortex, Last9 and VictoriaMetrics. Each has its strengths and trade-offs, so choose based on your specific requirements.

Monitoring and Troubleshooting Remote Write

Proactive monitoring of your remote write pipeline is essential for maintaining performance.

Key Metrics to Watch:

  • prometheus_remote_storage_samples_in_total
  • prometheus_remote_storage_samples_dropped_total
  • prometheus_remote_storage_queue_length
  • prometheus_remote_storage_sent_batch_duration_seconds

Set up alerts for abnormal values in these metrics to catch issues early.

Troubleshooting Tips:

  1. Check Prometheus logs for error messages
  2. Verify network connectivity to the remote endpoint
  3. Analyze remote write metrics for bottlenecks
  4. Review recent configuration changes

Case Study: Optimizing Remote Write at Scale

At Last9, we worked with a client who was struggling with remote write performance as their infrastructure grew to over 10,000 nodes. Here's how we optimized their setup:

  1. Implemented relabeling to reduce cardinality through the Last9 control plane
  2. Set up streaming aggregation to pre-aggregate high-cardinalitythe metrics
  3. Tuned queue settings based on observed traffic patterns

Results:

  • 60% reduction in remote write CPU usage
  • 45% decrease in network bandwidth consumption
  • 70% improvement in remote write latency

Best Practices and Future Considerations

As you optimize your Prometheus remote write setup, keep these best practices in mind:

  1. Start with conservative settings and adjust gradually
  2. Regularly review and refine your labeling strategy
  3. Monitor remote write performance continuously
  4. Keep your Prometheus version updated to benefit from ongoing improvements

Looking ahead, the Prometheus community is working on features like streaming remote write and improved backpressure handling, which promise to further enhance remote write performance.

Conclusion

Optimizing Prometheus remote write performance is an ongoing process that requires a deep understanding of your monitoring needs and infrastructure. By applying the techniques discussed in this guide – from queue configuration and cardinality management to effective relabeling and careful remote storage selection – you can significantly improve the efficiency and reliability of your Prometheus remote write setup.

Remember, there's no one-size-fits-all solution. Continuously monitor your system's performance, be prepared to make adjustments, and stay informed about new developments in the Prometheus ecosystem. With these strategies in hand, you'll be well-equipped to scale your monitoring infrastructure alongside your growing business needs.

If you still need to discuss some settings, jump onto the Last9 Discord Server to discuss any specifics you need help with.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Gabriel Diaz

Gabriel Diaz

Software Engineer at Last9