Sep 6th, ‘24/7 min read

Microservices Monitoring with the RED Method

This blog introduces the RED method—an approach that simplifies microservices monitoring by honing in on requests, errors, and latency.

Microservices Monitoring with the RED Method

As someone who's spent years navigating the complexities of microservices, I’ve definitely hit some bumps when it comes to monitoring.

That’s why I’m excited to dive into the RED method—a straightforward and effective approach that’s completely changed how we handle observability and troubleshooting. It’s made life a lot easier, and I’m here to share how it can do the same for you.

What is the RED Method?

The RED method, introduced by Tom Wilkie, is a monitoring philosophy tailored for microservices. It focuses on three key metrics:

  1. Rate - The number of requests per second
  2. Errors - The number of failed requests per second
  3. Duration - The amount of time each request takes

These metrics provide a comprehensive view of your service's health and performance, making it easier to identify and resolve issues quickly.

The Origins of the RED Method

Before we discuss the details of the RED method, it's worth understanding its origin and development. Tom Wilkie, a prominent figure in the observability space and one of the founders of Grafana Labs (formerly Kausal), introduced the method.

📖
Know more about the RED Method in this guide!

Wilkie first presented the RED method at the Prometheus meetup in London in 2015. At the time, he worked at Weaveworks, a company specializing in tooling for container-based applications. The challenges of monitoring complex distributed systems led him to develop this simplified yet powerful approach.

The RED method was born out of the need for a monitoring methodology that could handle the complexity of microservices architectures while remaining simple enough for teams to implement and understand quickly.

It was inspired by the USE method (Utilization, Saturation, Errors) developed by Brendan Gregg for resource monitoring, but tailored specifically for service monitoring.

The method quickly gained traction in the cloud-native community, particularly among teams using Prometheus for monitoring. Its simplicity and effectiveness made it a popular choice for organizations adopting microservices and containerized applications.

Why RED for Microservices?

Traditional monitoring approaches often fall short. With dozens or even hundreds of services interacting, it's easy to get lost in a sea of metrics. The RED method cuts through the noise, giving you a clear picture of what matters most.

As an SRE on an operations team managing a large-scale microservices architecture, I've found that RED metrics significantly reduce the cognitive load during troubleshooting. This is especially crucial when you're on-call and need to quickly diagnose issues affecting end-users.

The RED Method vs. Other Monitoring Methodologies

Before diving deeper into the RED method, let’s talk about the other popular monitoring methodologies, particularly the Four Golden Signals popularized by Google's Site Reliability Engineering (SRE) book.

RED vs. The Four Golden Signals

The Four Golden Signals consist of:

  1. Latency
  2. Traffic
  3. Errors
  4. Saturation

While there's overlap between RED and the Four Golden Signals, the RED method is more focused and tailored for microservices.

The main difference lies in the 'Saturation' signal, which RED doesn't explicitly cover. However, in practice, I've found that saturation issues often manifest as increased latency or error rates, which RED captures effectively.

USE Method

Another methodology worth mentioning is the USE method, developed by Brendan Gregg. USE stands for:

  • Utilization
  • Saturation
  • Errors

The USE method is particularly useful for monitoring system resources like CPU, memory, and I/O. While it's not specifically designed for microservices, it can complement the RED method when you need to dive deeper into resource-level issues.

📑
Get Started with PromQL: A Friendly Guide to Prometheus Query Language

Implementing RED Metrics

Let's talk about how you can implement RED metrics in your microservices. I'll share code examples in both Java and Python to cover a broad range of use cases.

Java Implementation

Here's a simple example using Spring Boot and Micrometer to implement RED metrics:

import io.micrometer.core.instrument.MeterRegistry;
import io.micrometer.core.instrument.Timer;
import org.springframework.web.bind.annotation.GetMapping;
import org.springframework.web.bind.annotation.RestController;

@RestController
public class ExampleController {

    private final MeterRegistry meterRegistry;

    public ExampleController(MeterRegistry meterRegistry) {
        this.meterRegistry = meterRegistry;
    }

    @GetMapping("/api/example")
    public String example() {
        Timer.Sample sample = Timer.start(meterRegistry);
        try {
            // Your API logic here
            return "Hello, RED!";
        } catch (Exception e) {
            meterRegistry.counter("api.errors").increment();
            throw e;
        } finally {
            sample.stop(meterRegistry.timer("api.duration"));
            meterRegistry.counter("api.requests").increment();
        }
    }
}

In this example, we're tracking all three RED metrics:

  • Rate: Incremented by api.requests counter
  • Errors: Tracked by api.errors counter
  • Duration: Measured by api.duration timer

Python Implementation

For Python services, you can use the Prometheus client library:

from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest

app = Flask(__name__)

REQUEST_COUNT = Counter('api_requests_total', 'Total API requests')
ERROR_COUNT = Counter('api_errors_total', 'Total API errors')
REQUEST_DURATION = Histogram('api_request_duration_seconds', 'API request duration')

@app.route('/api/example')
@REQUEST_DURATION.time()
def example():
    REQUEST_COUNT.inc()
    try:
        # Your API logic here
        return "Hello, RED!"
    except Exception as e:
        ERROR_COUNT.inc()
        raise e

@app.route('/metrics')
def metrics():
    return generate_latest()

if __name__ == '__main__':
    app.run(port=8000)

This Flask app exposes RED metrics that can be scraped by Prometheus.

📖
Learn the Essentials of Python Logging: Your Go-To Guide for Best Practices

Integrating with Observability Tools

To make the most of your RED metrics, you'll want to integrate them with modern observability tools. In my experience, the combination of Prometheus for metrics collection and Grafana for visualization works exceptionally well in a cloud-native environment.

Here's a sample Prometheus configuration to scrape metrics from our Python service:

scrape_configs:
  - job_name: 'python_app'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:8000']

And a basic Grafana dashboard query to visualize request rate:

sum(rate(api_requests_total[5m])) by (job)

Real-world Use Cases:

Let me share a real-world scenario where RED metrics saved the day. We had a microservice that was experiencing intermittent high latency. By analyzing our RED dashboard, we noticed that the request rate was spiking during certain hours, causing increased duration across the board.

Using this information, we implemented auto-scaling based on the request rate metric:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: my-service
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-service
  metrics:
  - type: Prometheus
    prometheus:
      metricName: api_requests_total
      targetAverageValue: 100

This Kubernetes HorizontalPodAutoscaler configuration allowed us to automatically scale our service based on the request rate, effectively handling the traffic spikes and maintaining low latency. This approach significantly improved our service's scalability and resilience.

Advanced Monitoring Techniques

While the RED method provides a solid foundation, there are several advanced techniques you can employ to enhance your monitoring strategy:

Black Box Monitoring

In addition to the internal metrics provided by RED, it's crucial to implement black box monitoring. This involves testing your services from an external perspective and simulating real user interactions.

Tools like Prometheus Blackbox Exporter can help you monitor HTTP endpoints, ensuring your services are not only internally healthy but also accessible and responsive to end-users.

Distributed Tracing

While RED metrics give you a high-level view of your services, distributed tracing allows you to follow a request as it travels through your microservices architecture.

Tools like OpenTelemetry provide a standardized way to implement tracing across different languages and frameworks. This can be particularly useful when debugging complex interactions between services.

Real-time Monitoring and Alerting

To make the most of your RED metrics, implement real-time monitoring and alerting. Tools like Prometheus Alertmanager or Last9 Alerting can help you set up sophisticated alerting rules based on your RED metrics.

For example, you could set up an alert for when the error rate exceeds a certain threshold or when the 95th percentile of request duration starts to climb.

Levitate Change Events are a boon in such scenarios because they allow you to quickly triage the change.

Challenges and Lessons

Implementing RED metrics isn't without its challenges. Here are a few lessons I've learned along the way:

  1. Consistency is key: Ensure all your services use the same naming conventions for RED metrics. This makes aggregation and alerting much easier.
  2. Watch out for high cardinality: Be cautious when adding labels to your metrics. High cardinality can put a strain on your monitoring system except when using Levitate!
  3. Don't neglect percentiles: While RED focuses on averages, tracking percentiles (e.g., 95th, 99th) for duration can help you catch outliers that affect user experience.
  4. Integrate with tracing: RED metrics are powerful, but they don't tell the whole story. Integrating with distributed tracing tools like OpenTelemetry can provide deeper insights when needed.
  5. Monitor the full stack: While RED focuses on service-level metrics, don't forget to monitor the underlying infrastructure. Keep an eye on CPU usage, memory, disk I/O, and network bandwidth. Tools like Node Exporter for Prometheus can help you collect these system-level metrics.
  6. Use time series databases: For efficient storage and querying of your metrics, consider using specialized time series databases. Levitate and Prometheus's TSDB and VictoriaMetrics are popular choices in the cloud-native ecosystem.
  7. Leverage cloud provider tools: If you're running on cloud platforms like AWS, take advantage of their native monitoring tools. AWS CloudWatch, for example, can be integrated with your RED metrics to provide a comprehensive view of your application and infrastructure.

Conclusion:

The RED method has been a game-changer in how we approach microservices monitoring. By focusing on these three critical metrics, we've significantly improved our ability to detect and respond to issues, ultimately delivering a better experience for our end-users.

As you implement RED metrics, you'll likely find yourself diving into other aspects of observability, from logging to tracing to advanced debugging techniques.

💹
We’d love to hear your stories about reliability, observability, or monitoring! Join the chat and share your insights with us in the SRE Discord community.

What are red metrics?

RED metrics focus on three critical performance indicators for microservices: Rate (requests per second), Errors (failed requests), and Duration (latency). These metrics help in monitoring the health and performance of services effectively.

What are metrics in monitoring?

Metrics in monitoring refer to quantitative measurements that track the performance and health of systems, such as CPU usage, memory consumption, request rates, and error rates. They provide insights into how well a system is operating.

What are the four golden signals vs red metrics?

The four golden signals include Latency, Traffic, Errors, and Saturation, which monitor system performance broadly. RED metrics (Rate, Errors, Duration) specifically target microservices performance, focusing on request handling.

What are the four golden signals?T

he four golden signals are Latency, Traffic, Errors, and Saturation. They offer a comprehensive view of system health, tracking the speed of requests, the volume of traffic, failure rates, and resource utilization.

What are the benefits of monitoring?

Monitoring helps in identifying system issues early, improving performance, reducing downtime, ensuring reliability, and supporting capacity planning. It also aids in optimizing resources and enhancing user experiences.

What is the difference between a metric and a golden signal?

A metric is a quantitative measurement, such as CPU usage or error rates, while a golden signal is a specific set of key metrics (Latency, Traffic, Errors, and Saturation) used to evaluate the overall health of a system.

What is the difference between golden signals and red metrics?

Golden signals monitor overall system performance with a broader focus, while RED metrics specifically measure the performance of microservices through Rate, Errors, and Duration, providing more targeted insights for service-level monitoring.

Newsletter

Stay updated on the latest from Last9.

Authors

Prathamesh Sonpatki

Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

Handcrafted Related Posts