When it comes to keeping systems running smoothly, SRE teams need a way to track what really matters. That’s where the Golden Signals play an important role. These are a set of key metrics that help teams monitor the health of their systems by focusing on the most important aspects of performance.
In this blog, we’ll break down what the Golden Signals are, why they’re so important, and how they apply to modern setups like microservices and distributed systems.
What Are the Golden Signals?
The Golden Signals are a set of four crucial metrics that every monitoring system should track.
These metrics are designed to help you quickly diagnose problems in your systems, maintain reliability, and improve performance over time.
The four Golden Signals are:
Latency: How long it takes for a request to be processed by the system. If your latency is too high, users will experience slow performance, which leads to frustration and, ultimately, user churn.
Saturation: The measure of how much demand is being placed on your system compared to its capacity. Saturation is about understanding when your system is approaching its limits. Too much saturation leads to performance degradation, resource constraints, and, eventually, service failure.
Error Rate: The percentage of requests that result in failure. If your error rate rises significantly, it's a strong indicator that something has gone wrong in your system, whether it's a server issue, a broken application, or a misconfigured load balancer.
Traffic: The number of requests your system is receiving. Monitoring traffic helps determine if the system is being overloaded or if resources need to be scaled up or down to meet demand.
Why Are These Signals So Important?
The importance of the Golden Signals cannot be overstated. By tracking these metrics, SRE teams can:
Ensure System Health: Monitoring latency, error rate, saturation, and traffic helps teams understand the overall health of a system. If one of these signals deviates from the expected values, it’s a clear sign that something is wrong.
Enable Proactive Monitoring: By observing these metrics in real-time, teams can act before users experience significant problems. This proactive approach helps to avoid costly downtime and improves user experience.
Improve Capacity Planning: By tracking saturation and traffic, SREs can anticipate potential bottlenecks and plan for future capacity needs. Understanding these metrics allows teams to scale resources in advance, ensuring that the system can handle increased demand without failure.
Optimize Performance: If you’re experiencing high latency or a rising error rate, it’s a signal that something is affecting the user experience. These signals allow teams to drill into the root cause and optimize the system for better performance.
How Do These Golden Signals Apply to SRE?
Site Reliability Engineering is all about ensuring that systems are reliable, scalable, and performant.
The Golden Signals are directly aligned with SRE principles, which emphasize measuring, monitoring, and improving the reliability of services. In SRE, reliability is often quantified by Service Level Objectives (SLOs) and Service Level Indicators (SLIs), and the Golden Signals provide the most relevant SLIs to focus on.
Here’s how:
Latency: If your system’s latency exceeds the SLO, then you know something is wrong. SRE teams use latency to gauge how quickly a service is responding to user requests. This impacts user satisfaction and system performance.
Saturation: Understanding saturation levels ensures that your system has enough capacity to handle its traffic without degrading performance. SRE teams monitor this to avoid bottlenecks, such as when CPU usage is too high or memory allocation is exceeded.
Error Rate: If the error rate rises above a certain threshold, it’s time to investigate. High error rates can indicate issues such as failed HTTP requests, problems with APIs, or backend failures, all of which impact reliability and user experience.
Traffic: The number of requests can tell you if you’re seeing unexpected surges in traffic, which can lead to saturation and eventually system failure. Traffic monitoring helps SRE teams anticipate and handle these surges.
In short, by keeping a close eye on these signals, SRE teams can effectively manage reliability, ensure smooth operations, and maintain system performance under various conditions.
Using Monitoring Tools for Golden Signals
To effectively monitor these signals, SRE teams rely on a variety of monitoring tools like Prometheus, Grafana, APM solutions, and Last9. These tools help teams track key metrics in real-time, making it easier to spot and resolve issues quickly. Let’s dive into how each of these tools supports monitoring the Golden Signals:
Prometheus: A powerful open-source tool for collecting and querying metrics. It integrates seamlessly with Kubernetes, making it a perfect fit for containerized environments. Prometheus excels at gathering critical metrics like latency, traffic, and error rates, and sending them over to Grafana for visualization.
Grafana: Known for its ability to turn raw data into clear, actionable insights, Grafana lets you visualize your Golden Signals on custom dashboards. Whether you're tracking latency, traffic, or error rates, Grafana helps you get a quick overview of your system's health.
APM Tools: Tools like Last9, New Relic, Datadog, and others offer deep dives into application performance. These tools allow teams to monitor error rates, and latency, and pinpoint specific performance issues within the application code or infrastructure.
Best Practices for Monitoring the Golden Signals
Here are some best practices for monitoring these Golden Signals:
Set Up Alerts for Anomalies: Use automated alerts to notify you when latency, error rates, or traffic exceeds predefined thresholds. This helps your team react quickly to issues.
Use Percentiles for Latency: Rather than focusing on average response time, use percentiles (such as 95th or 99th percentile) to track latency. This ensures that you're not just catching outliers but also understanding the impact on the majority of your users.
Automate Scaling: For saturation and traffic monitoring, set up auto-scaling to ensure that your system has enough resources to handle spikes in demand.
Correlate Signals: Look at how latency, error rates, and traffic interact. For example, a spike in traffic might cause an increase in latency and error rates, signaling that your system is reaching capacity.
Utilize Distributed Tracing: For microservices and distributed systems, distributed tracing helps you track requests as they move through the system, providing deeper insight into performance bottlenecks and failures.
How Golden Signals Help You Troubleshoot and Optimize
Now that we’ve covered the fundamentals of the Golden Signals, let’s explore how these metrics apply in real-world scenarios—especially in complex systems like microservices and distributed systems.
Plus, we’ll tackle some frequently asked questions that will help you gain a deeper understanding of why these signals are essential for keeping your system healthy and performing well.
Golden Signals in Microservices and Distributed Systems
In microservices and distributed systems, tracking the Golden Signals becomes even more crucial. These environments can have many moving parts—APIs, databases, front-end and back-end components, etc.—making it challenging to get a clear picture of overall system health.
Here’s how the Golden Signals can be applied:
Latency in Microservices:
In a microservices environment, latency could be introduced between services, especially when one service has to call another to fulfill a request.
Latency can stack up as requests travel through multiple services, databases, and APIs. Tracking latency across services helps pinpoint where delays are happening, enabling teams to optimize and troubleshoot specific parts of the system.
Saturation in Distributed Systems:
Distributed systems often rely on various nodes to handle requests. If any of these nodes reach their saturation point, it can cause cascading failures or slowdowns. Monitoring saturation helps teams identify bottlenecks—whether it’s CPU usage, memory, or network bandwidth—and prevent systems from going into overload.
Error Rates in Complex Architectures:
With microservices, failure in one service can lead to a failure in another, affecting the overall error rate. By tracking error rates in different services, teams can identify which service is failing and why. It could be an API returning a 500 HTTP error, a backend database becoming unresponsive, or a misconfigured load balancer.
Traffic Across Microservices:
As traffic increases or changes, it’s important to monitor traffic at both the individual service level and the system-wide level. Spikes in traffic may affect specific microservices, causing overloads, or may require scaling up to handle the additional requests.
Troubleshooting Performance Degradation
When you notice performance degradation or issues like high latency or increasing error rates, the Golden Signals act as your first line of defense. Here’s how to troubleshoot effectively using these metrics:
Latency Spikes:
If you see a latency spike, first check the underlying infrastructure—are resources maxed out?
Next, look at the service-level metrics: which microservice or component is introducing the delay? It could be a database query taking too long, a network bottleneck, or inefficient code.
Error Rate Increases:
An uptick in error rates might point to a faulty service, a misconfigured API, or a problem with resource availability.
If errors are mainly HTTP 500s or 503s, it might indicate server overload or a failure in the backend services. Check the error logs and service-level health checks for more details.
Saturation and Resource Utilization:
If you see saturation levels rising or system resources nearing full utilization, this often indicates that your system is struggling to keep up with the demand.
Check your CPU usage, memory allocation, and network bandwidth. Are your services hitting their limits? It might be time to scale your system or optimize resource usage.
Traffic Spikes:
If traffic suddenly spikes, your system might not be ready to handle the increase. Check your monitoring dashboards for patterns—was the spike due to a marketing campaign, new product release, or sudden user interest?
This can help you anticipate and prepare for future traffic increases.
Effective troubleshooting involves looking at the root cause and not just the symptoms. The Golden Signals give you a framework to understand the system performance, so you can prioritize fixes based on impact.
Conclusion
Focusing on the Golden Signals—latency, saturation, error rate, and traffic—helps SRE teams improve system reliability, performance, and scalability.
Today, monitoring and observability are more critical than ever. With tools like Prometheus, Grafana, Last9, and other monitoring systems, SRE teams can gain real-time visibility into system health and take action to ensure the best possible user experience.
🤝
If you’d like to chat more on golden signals or any other topic, join our community on Discord! We have a dedicated channel where you can discuss your use case with other developers.
FAQs
Q1: What are the Golden Signals?
The Golden Signals are four key metrics for monitoring the health and performance of systems: latency, saturation, error rate, and traffic. These signals help SRE teams understand how well their systems are performing and identify potential problems before they affect users.
Q2: What are the 4 signals of SRE?
The four signals of SRE are the same as the Golden Signals: latency, saturation, error rate, and traffic. These metrics are central to monitoring and maintaining system reliability.
Q3: What are the 4 Golden Signals of red use?
The term "red use" may be a bit unclear. If you're referring to the "Red" signal in the context of error rates, it highlights problems in your system such as failed requests or broken responses that lead to performance degradation. Error rate (a golden signal) is often associated with identifying these "red flags" in the system.
Q4: What are the 4 Golden Rules of Observability?
While the Golden Signals provide a foundation for observability, the four Golden Rules could refer to principles like measure what matters, ensure full coverage, monitor in real-time, and act on the data. These rules help teams ensure they’re collecting the right data and using it to drive decisions.
Q5: Why are the Golden Signals important?
They are critical because they provide a quick and effective way to monitor system performance, identify issues, and prevent downtime. By focusing on these four signals, SRE teams can ensure high availability and optimize user experience.
Q6: What is a REST API?
A REST API is a standard for designing web services that allows communication between a client and server over HTTP. It’s an essential part of modern web and mobile applications, where services rely on APIs to send and receive data.
Q7: Why isn’t liveness part of the Golden Signals?
While liveness checks if a service is up and running, the Golden Signals focus on the performance of the service—latency, error rate, saturation, and traffic. Liveness checks are valuable, but they don’t directly correlate to how well the system is performing.
Q8: What level of saturation ensures service performance and availability for customers?
The right level of saturation varies by system. However, a good rule of thumb is to aim to keep your system below 80-85% of its capacity to ensure enough resources for handling unexpected surges without compromising performance.
Q9: Are customers consistently experiencing page load or latency lag?
Tracking latency and traffic will help answer this. If users consistently experience slow page loads, check your latency metrics across your services. Identify whether it’s due to a backend service or a resource constraint.
Q10: How do the Golden Signals apply to a microservices architecture?
In a microservices setup, the Golden Signals help track the performance of each individual service, making it easier to identify bottlenecks or failures. They help teams ensure that requests are processed efficiently and that services scale appropriately under load.
Q11: How do the Golden Signals apply to monitoring distributed systems?
In distributed systems, the Golden Signals offer insight into how each component is performing, helping teams identify issues that may affect the entire system. Latency, for example, can increase as requests pass through multiple services, while traffic can give you an indication of system load at different points in the architecture.