What is Saturation and why should you think about it as an SLO?
The Google SRE handbook has the most practical advice for us SREs - the four golden signals of monitoring. They state that if you can only measure four metrics of your user facing system, focus on - latency, traffic, errors and saturation.
And while the first three are fairly intuitive when thinking about user facing systems, saturation SLOs have always been a point of discussion in the SRE community. Today, we attempt to clarify that.
Quick definition for the uninitiated - Saturation, essentially, defines how “full” is your service. Quoting the SRE bible, it is “a measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O)”
Saturation can be understood as the load on your network and server resources. This typically translates to metrics like CPU utilization, memory usage, disk capacity, operations per second and many more.
Saturation can therefore be understood with higher-level load measurement. “Can your service properly handle double the traffic, handle only 10% more traffic or handle even less traffic than it currently receives” is a typical way to understand the saturation levels of a complex system. Most services, simple or complex, are dependent on additional variables in the entire stack (such as CPU utilization, network bandwidth) and therefore it is key to know the saturation capacity across the stack.
It is also important to measure saturation as a metric because many systems degrade in performance before they achieve 100% utilization. Given these metrics are typically leading indicators, it is helpful to measure them because they can be adjusted and improved before performance actually degrades. Enter the Saturation SLO.
As a side note, It is interesting how one of the other four golden signals, Latency, is often a leading indicator of saturation. SREs often use measuring response time over some small window to give a very early signal of saturation.
Who is the key owner of the Saturation SLO?
While we cannot deny the importance of end-users (and have the other three golden signals catering to them), they are not exactly our “true” customers when considering saturation SLO. Why, you ask? Because customers only care about the utilization and receiving a seamless service. They are less concerned with behind the scenes, such as how much space remains on the disk, as long as the page loads.
If you think your internal developers' team relies on stable infrastructure to publish their applications, that’s also not correct. They only want most of their applications to deploy as per their expectation. Your only “true customers” are the team that gets paged when a resource becomes oversaturated. Thus, before deciding any threshold or metrics for a saturation SLO, the stakeholders you should consult are the team that owns the saturation metric (typically infrastructure teams). Only they should be able to define whether they are comfortable to break hell when the disk space is at 1% remaining, or 5% or 10%. This will of course be refined as you monitor and understand the metric better and doesn’t have to be perfect from the beginning, but the first educated guess has to come from the owning team.
How to create a Saturation SLO
So we have defined what is Saturation, why should it be tracked and who is the key owner of this SLO? The next logical question to ask is how to finally set the saturation SLO.
After we reach the first initial threshold, the easiest way to begin tracking the saturation SLO is by tracking minutes of “good” utilization. At the same time, you should ensure that the SLO is realistic and the team has the capability to meet them. For e.g, You can say that your “good” events were the minutes in which space utilization was less than the threshold you set (the numerator). The “valid” events (or the denominator) ideally would be every minute of the day. If this metric then crosses the set threshold, your error budget begins to crash.
Now we are in familiar territory. Use the error budget understanding to balance stability with velocity - operational work with shipping latest features. The ever constant battle of our life - SREs assemble!!