Prometheus has become an essential part of modern observability stacks, providing powerful time-series data collection and alerting capabilities. However, as organizations scale their infrastructure, they often encounter limitations with Prometheus' single-instance storage model.
This is where Remote Write functionality helps you, allowing Prometheus to effortlessly send metrics to external storage systems while maintaining its powerful collection and querying capabilities.
What is Prometheus Remote Write?
Prometheus Remote Write is a protocol that enables Prometheus to send metrics data to compatible external storage systems in real-time. This feature addresses several critical needs:
- Long-term storage: Retain metrics beyond Prometheus' local retention limits
- High-availability: Create redundant copies of your metrics data
- Centralization: Collect metrics from multiple Prometheus instances in a single location
- Specialized storage: Leverage databases optimized for specific query patterns
The observability ecosystem has widely adopted this protocol, with many solutions now offering Prometheus-compatible remote write endpoints, including Prometheus-compatible storage engines, Cortex, Thanos, and various cloud provider offerings.
Key Components of Remote Write Architecture
The Remote Write architecture consists of three primary components:
- Prometheus Server: The source of metrics data, responsible for scraping targets and forwarding metrics
- Remote Write Protocol: A well-defined HTTP-based protocol using Protocol Buffers for efficient data serialization
- Remote Write Endpoint: The destination system that receives, processes, and stores the metrics
This architecture maintains Prometheus' pull-based collection model while adding a push-based capability for storage, creating a flexible and scalable observability pipeline.
Configuring PrometheusRemoteWriteExporter in OpenTelemetry
Remote Write is configured in the Prometheus configuration file (usually prometheus.yml
) using YAML. Here's a basic example:
remote_write:
- url: "https://remote-write-endpoint.example.com/api/v1/write"
basic_auth:
username: "prometheus"
password: "secret"
write_relabel_configs:
- source_labels: [__name__]
regex: "unwanted_metric.*"
action: drop
This configuration instructs Prometheus to:
- Send metrics to the specified URL
- Authenticate using basic authentication
- Apply relabeling rules to filter metrics before sending
Using External Labels for Source Identification
When aggregating metrics from multiple Prometheus instances, it's crucial to identify the source of each metric. External labels add global metadata to all metrics sent from a specific Prometheus instance:
global:
external_labels:
region: "us-west-1"
environment: "production"
cluster: "main-cluster"
These labels help distinguish metrics from different Prometheus instances when they're aggregated in a central system.
Write Relabeling for Filtering and Transformation
Write relabeling allows you to modify or filter metrics before they're sent to the remote endpoint:
write_relabel_configs:
- source_labels: [__name__, job]
separator: ";"
regex: "node_.*sockets;node_exporter"
action: keep
This is powerful for:
- Reducing data volume by dropping unnecessary metrics
- Normalizing labels across different sources
- Adding or modifying metadata before storage
Critical Settings for Optimal Performance
Scrape Interval vs. Evaluation Interval: What's the difference?
Two important configuration parameters in Prometheus are often confused but serve distinct purposes:
Scrape Interval: Controlling Data Collection Frequency
The scrape_interval
defines how frequently Prometheus collects metrics from monitored targets:
global:
scrape_interval: 15s
scrape_configs:
- job_name: "node-exporter"
scrape_interval: 5s # Overrides global setting for this job
Key points about scrape interval:
- Affects data resolution and storage requirements
- Can be set globally and overridden per job
- Shorter intervals provide more detail but increase resource usage
- Should align with the dynamics of the metrics you're collecting
Evaluation Interval: Managing Rule Processing
The evaluation_interval
determines how frequently Prometheus evaluates recording and alerting rules:
global:
evaluation_interval: 30s
rule_files:
- "rules/recording_rules.yml"
- "rules/alerting_rules.yml"
Key differences from scrape interval:
- Controls rule processing frequency, not data collection
- Affects alert responsiveness and resource consumption
- Typically longer than the scrape interval to reduce computational load
- Should be tuned based on the urgency of your alerting needs
Balancing Intervals for Optimal Performance
Choosing appropriate intervals requires balancing several factors:
- Lower intervals increase resolution but consume more resources
- Scrape interval should be shorter than the shortest-lived phenomena you want to observe
- Evaluation interval should be shorter than the acceptable delay for alerts
- Both should be consistent with your retention and query needs
A common pattern is to use shorter scrape intervals for critical infrastructure (5- 10s) and longer intervals for less dynamic systems (30-60s).
Remote Write vs. Federation: Choosing the Right Approach
When scaling Prometheus beyond a single instance, you have two primary options: Remote Write and Federation. Understanding the differences is crucial for designing an effective monitoring architecture.
Prometheus Federation: Hierarchical Metric Collection
Federation allows a Prometheus server to scrape selected time series from another Prometheus server:
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets:
- 'source-prometheus:9090'
Federation is useful for:
- Building hierarchical Prometheus deployments
- Aggregating metrics from multiple Prometheus instances
- Creating global views across different environments
Key Differences Between Remote Write and Federation
Feature | Remote Write | Federation |
---|---|---|
Data Flow | Push-based | Pull-based |
Latency | Low (real-time) | Higher (depends on scrape interval) |
Completeness | All metrics | Selected metrics only |
Storage | External system | Local Prometheus storage |
Resource Impact | Network and CPU on sender | Network and CPU on receiver |
High Availability | Built for HA setups | Requires additional configuration |
Scalability | Highly scalable | Limited by single-instance constraints |
Remote Write in Kubernetes Environments
Kubernetes presents specific considerations for Remote Write:
- Resource Management: Configure appropriate limits and requests for Prometheus pods to ensure stable operation
- Network Policies: Ensure outbound connectivity to remote write endpoints
- Authentication: Use Kubernetes secrets for secure credential management
- High Cardinality: Be cautious with Kubernetes labels that can cause high cardinality issues
- Monitoring the Monitoring: Use metrics like
prometheus_remote_storage_*
to monitor the health of your remote write setup
When using tools like Prometheus Operator, Remote Write can be configured through custom resources:
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus
spec:
remoteWrite:
- url: "https://remote-write.example.com/api/v1/write"
basicAuth:
username:
name: remote-write-auth
key: username
password:
name: remote-write-auth
key: password
How to Integrate with OpenTelemetry
The PrometheusRemoteWriteExporter in OpenTelemetry provides a bridge between OpenTelemetry and Prometheus ecosystems, allowing metrics collected by OpenTelemetry to be sent to any Prometheus-compatible remote write endpoint.
Setting Up the OpenTelemetry Collector
The OpenTelemetry Collector acts as a central hub for telemetry data, processing and forwarding it to various backends:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
processors:
batch:
timeout: 10s
send_batch_size: 1000
exporters:
prometheusremotewrite:
endpoint: "https://remote-write-endpoint.example.com/api/v1/write"
auth:
authenticator: basicauth/remote
resource_to_telemetry_conversion:
enabled: true
service:
pipelines:
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheusremotewrite]
extensions:
basicauth/remote:
client_auth:
username: "${REMOTE_WRITE_USERNAME}"
password: "${REMOTE_WRITE_PASSWORD}"
Advanced Configuration Options
The PrometheusRemoteWriteExporter supports several advanced configuration options:
Queue Management
Control how metrics are buffered and sent:
exporters:
prometheusremotewrite:
endpoint: "https://remote-write-endpoint.example.com/api/v1/write"
remote_write_queue:
capacity: 10000 # Maximum number of samples to buffer
max_samples_per_send: 500 # Maximum number of samples per send
Retry Handling
Configure retry behavior for resilience:
exporters:
prometheusremotewrite:
endpoint: "https://remote-write-endpoint.example.com/api/v1/write"
retry_on_failure:
enabled: true
initial_interval: 5s
max_interval: 30s
max_elapsed_time: 5m
TLS Configuration
Secure your connection with TLS:
exporters:
prometheusremotewrite:
endpoint: "https://remote-write-endpoint.example.com/api/v1/write"
tls:
ca_file: "/path/to/ca.crt"
cert_file: "/path/to/client.crt"
key_file: "/path/to/client.key"
insecure: false # Set to true only for testing
Advanced Remote Write Topics: Scaling and Optimization
Remote Write Filtering and Sampling for High-Volume Metrics
For high-volume Prometheus deployments, sending all metrics to remote storage may be impractical. Implement strategic filtering:
remote_write:
- url: "https://critical-metrics.example.com/api/v1/write"
write_relabel_configs:
- source_labels: [__name__]
regex: 'critical_.*|up|instance:.*'
action: keep
- url: "https://all-metrics.example.com/api/v1/write"
queue_config:
capacity: 20000
For extremely high-volume use cases, consider implementing sampling:
remote_write:
- url: "https://sampled-metrics.example.com/api/v1/write"
write_relabel_configs:
- source_labels: [__name__]
regex: 'high_volume_metric'
action: drop
if: '(randn() % 10) != 0' # Keep approximately 10% of samples
Multi-Endpoint Remote Write Strategy
Sending to multiple remote endpoints provides redundancy and specialized storage:
remote_write:
- url: "https://long-term-storage.example.com/api/v1/write"
write_relabel_configs:
- source_labels: [__name__]
regex: 'business_.*|sla_.*'
action: keep
- url: "https://alert-metrics.example.com/api/v1/write"
write_relabel_configs:
- source_labels: [__name__]
regex: '.*_alerts|up|.*_status'
action: keep
- url: "https://all-metrics.example.com/api/v1/write"
queue_config:
max_samples_per_send: 1000
This approach enables:
- Different retention policies for different metric types
- Specialized query engines for specific use cases
- Cost optimization by routing high-value metrics to premium storage
Troubleshooting Common Remote Write Issues and Solutions
Connection and Authentication Issues
Timeout Problems
If you encounter timeouts:
- Check network connectivity and firewall rules
Validate endpoint health with a simple HTTP request:
curl -v https://remote-write-endpoint.example.com/api/v1/write
Increase timeout settings:
remote_write:
- url: "https://remote-write-endpoint.example.com/api/v1/write"
remote_timeout: 60s # Default is 30s
Authentication Failures
Common authentication issues:
- Check for URL-encoding issues in passwords
- Verify that credentials have appropriate permissions
Validate credentials separately using curl:
curl -u "${USERNAME}:${PASSWORD}" https://remote-write-endpoint.example.com/api/v1/write
Data Quality and Performance Issues
Label Value Problems
Prometheus has strict requirements for label values:
- Only ASCII characters are allowed
- Label values have length limits
- Some endpoints may have additional restrictions
Monitor for label validation errors in the Prometheus logs:
level=warn ts=... component=remote msg="Remote storage returned HTTP status 400 Bad Request; error: invalid label value..."
High Cardinality
Watch for exploding cardinality, which can overwhelm remote storage:
- Monitor metrics like
prometheus_tsdb_head_series
- Be cautious with automatically generated labels in Kubernetes environments
Use relabeling to reduce cardinality:
write_relabel_configs:
- source_labels: [instance]
target_label: instance
regex: '(.*):.*'
replacement: '$1'
Monitoring Remote Write Performance
Key metrics to monitor:
prometheus_remote_storage_samples_pending
: Samples waiting to be sentprometheus_remote_storage_failed_samples_total
: Samples that couldn't be sentprometheus_remote_storage_sent_batch_duration_seconds
: Time to send batchesprometheus_remote_storage_succeeded_samples_total
: Successfully sent samplesprometheus_remote_storage_retried_samples_total
: Samples that required retries
Create a dedicated dashboard for these metrics to quickly identify issues.
Best Practices for Production Deployments
Architecture and Planning
- Start with Clear Requirements:
- Define retention periods for different metric types
- Identify query patterns and performance needs
- Establish SLAs for monitoring availability
- Choose the Right Tools:
- Select appropriate remote storage based on scale and query needs
- Consider managed services vs. self-hosted options
- Evaluate cost implications for different retention periods
- Design for High Availability:
- Implement redundant Prometheus instances
- Use multiple remote write endpoints for critical metrics
- Plan for failure scenarios with appropriate retention
Configuration and Tuning
- Optimize Resource Usage:
- Filter unnecessary metrics using write relabeling
- Use appropriate scrape and evaluation intervals
- Configure queue settings based on load testing
- Security Best Practices:
- Use TLS for all remote write connections
- Rotate authentication credentials regularly
- Apply the principle of least privilege for remote write accounts
- Monitoring Your Monitoring:
- Set up alerts for remote write failures
- Monitor queue sizes and batch durations
- Create dashboards for remote write performance metrics
Operational Excellence
- Documentation and Knowledge Sharing:
- Document your remote write architecture
- Create runbooks for common failure scenarios
- Share best practices across teams
- Regular Audits:
- Review what metrics are being sent and their value
- Analyze storage usage and costs
- Identify opportunities for optimization
- Continuous Improvement:
- Stay updated with Prometheus and remote storage developments
- Test new features in non-production environments
- Refine your approach based on operational experience
Conclusion
Remote Write is a foundational capability for scaling Prometheus beyond a single instance, enabling enterprises to build comprehensive and resilient observability platforms.
FAQs
What does Prometheus remote write do?
Prometheus remote write allows Prometheus to send metrics data in real-time to external storage systems. It enables long-term storage beyond Prometheus' local retention limits, creates high-availability setups, centralizes metrics from multiple Prometheus instances, and integrates with specialized time-series databases optimized for specific workloads.
This capability is essential for proper metrics ingest at scale, particularly when visualizing Prometheus metrics in tools like Grafana.
What is the remote write spec?
The remote write spec is a protocol definition that enables Prometheus to send metrics to compatible external systems. It uses HTTP as the transport layer and Protocol Buffers for efficient data serialization.
The spec defines how metrics, labels, and timestamps are encoded, compressed, and transmitted to ensure compatibility between Prometheus and various storage backends. While the wire format uses Protocol Buffers, you can inspect the data structure in JSON format for debugging purposes.
What is the difference between scrape_interval and evaluation_interval?
- scrape_interval: Determines how frequently Prometheus collects metrics from monitored targets. It affects data resolution and storage requirements.
- evaluation_interval: Controls how frequently Prometheus evaluates recording and alerting rules. It affects alert responsiveness and rule processing load.
While scrape_interval focuses on data collection, evaluation_interval deals with processing that data through rules. The official Prometheus docs on GitHub provide detailed explanations of these settings and their impact on performance.
What is the difference between Prometheus remote write and federation?
Prometheus remote write pushes metrics to external storage in real-time, while federation pulls selected metrics from other Prometheus servers.
Remote write offers lower latency, complete metrics collection, and is built for high-availability setups. Federation is pull-based, can only collect selected metrics, and is more suitable for hierarchical deployments with limited metric needs.
Many organizations use remote write to ingest Prometheus metrics into cloud platforms like AWS Managed Service for Prometheus or Azure Monitor.
How do I configure Prometheus remote write to send metrics to a remote storage system?
Add a remote_write section to your Prometheus configuration:
remote_write:
- url: "https://remote-storage-system.example.com/api/v1/write"
basic_auth:
username: "prometheus"
password: "secret"
This configuration sends all metrics to the specified endpoint with authentication. For AWS or Azure cloud environments, you'll typically need to configure specific authentication mechanisms as outlined in their respective docs.
How can I configure Prometheus remote write for high availability?
Configure multiple Prometheus instances to send metrics to the same remote storage:
remote_write:
- url: "https://remote-storage.example.com/api/v1/write"
queue_config:
max_shards: 10 # Increase for higher throughput
capacity: 20000 # Buffer capacity during outages
Also, add unique external labels to identify the source Prometheus:
global:
external_labels:
prometheus_replica: "replica1"
datacenter: "us-east"
This allows you to maintain consistent metrics ingest even during Prometheus instance failures, ensuring uninterrupted Grafana dashboards.
How do I configure Prometheus remote write for long-term storage?
Configure remote write to send metrics to a storage system designed for long-term retention:
remote_write:
- url: "https://long-term-storage.example.com/api/v1/write"
write_relabel_configs:
- source_labels: [__name__]
regex: 'important_metric.*'
action: keep # Only send important metrics for long-term storage
Consider using filtering to reduce storage costs for long-term retention. AWS Timestream and Azure Data Explorer are popular cloud services for this purpose, offering tiered storage options for cost-effective long-term metrics storage.
How can I configure Prometheus remote write to send data to a specific endpoint?
Specify the exact endpoint URL in your configuration:
remote_write:
- url: "https://specific-endpoint.example.com/api/v1/write"
authorization:
type: Bearer
credentials: "${BEARER_TOKEN}" # Using bearer_token authentication
headers:
X-Tenant-ID: "tenant123" # Add any required custom headers
Many endpoints support bearer_token authentication as an alternative to basic auth. The GitHub repository for Prometheus contains extensive documentation on all supported authentication methods.
How can I visualize metrics sent via remote write?
Grafana is the most popular tool for visualizing Prometheus metrics stored in remote write destinations. Configure Grafana to connect to your remote write endpoint:
- Add a new data source in Grafana
- Select the appropriate data source type (Prometheus, AWS Managed Service for Prometheus, Azure Monitor, etc.)
- Configure the connection details, including authentication
- Create dashboards that query your metrics using PromQL
Grafana provides pre-built dashboards for common Prometheus metrics that you can import and customize.
Can I use remote write with cloud provider observability solutions?
Yes, major cloud providers support Prometheus remote write:
- AWS: AWS Managed Service for Prometheus offers a fully managed Prometheus-compatible monitoring service with remote write endpoints
- Azure: Azure Monitor supports Prometheus remote write through its metrics endpoint
- Google Cloud: Cloud Monitoring (formerly Stackdriver) provides a Prometheus remote write adapter
Each cloud provider's docs contain specific configuration details for their remote write implementations, including authentication and endpoint formats.
How can I troubleshoot JSON-related issues with remote write?
If you're experiencing issues with the remote write protocol:
- Enable debug logging in Prometheus to see the data being sent
- Use the
/debug/pprof/heap
endpoint to check for memory issues - Check for JSON parsing errors in your remote write endpoint logs
Examine the Protocol Buffer payloads in JSON format for debugging:
curl -s http://prometheus:9090/api/v1/status/runtimeinfo | jq .
Remember that while you can inspect the protocol data in JSON format, the actual wire format uses Protocol Buffers for efficiency.