Service Level Objectives or SLOs serve as an objective measure of your system's performance. And when designed well, SLOs can help you direct engineering efforts effectively. It does not matter whether you're working in a startup or a tech giant; there is always a natural tension between the speed of product development and the operational stability of the product.
The problem is that the more you change your existing systems, the higher chances of issues cropping up. In today's hyper-competitive markets, organizations cannot risk the possibility of a bad customer experience. Designing SLOs helps you make data-driven decisions about future product development without compromising the existing customer experience.
SLOs act as the bridge that connects and aligns development and operations teams to the common goal of service excellence. It helps developers appreciate the impact of new developments on existing systems. Similarly, it helps operational teams proactively mitigate system malfunctions due to new feature developments.
A well-designed SLO helps organizations with:
- Monitoring the health and improving the observability of existing services.
- Collecting data points to make informed business decisions.
- Proactively notify concerned teams of system issues.
Types of SLOs
There are mainly two types of SLOs, request-based, and windows-based. Let's understand these in detail.
- Request-Based SLOs
Request-based SLOs form a measure of the successful completion of system requests as part of the customer journey. These depend on an SLI (Service Level Indicator), defined as the ratio of the number of favorable outcomes to the total number of requests.
For instance, a request-based SLO could be defined as "Response time (latency) is below 100 ms for at least 90% of total requests." This implies that requests with a response time of under 100 ms meet the SLO. The service is compliant if the ratio of requests is under 100 ms and the total number of requests exceeds 0.90%.
A request-based SLO gives a bird's-eye view of the percentage of satisfactory outcomes over the entire compliance period.
- Window-Based SLO
Window-based SLOs measure the efficacy of a system over a period (window), defined as the ratio of the time intervals that meet performance criteria versus the total number of time intervals considered.
For instance, an SLO can be defined as "99% of 10-minute windows should have 90% request latency below 100 ms." A window-based SLO can be understood as the performance measurement of request-based SLOs over a period.
Let's take another example, considering the performance of a system over a period of 30 days with 1 minutes intervals. If your SLO is to have 99% favorable outcomes, then your service must produce 42,768 "good" results out of the 43,200 minutes in 30 days.
Every SLO is measured based on the compliance period. A compliance period can be understood as the period for which an SLO is measured. The 30 days in the above-mentioned window-based SLO example was the compliance period for that SLO.
There are two types of compliance periods:
- Calendar Period
A calendar period is the measurement of an SLO over a set time. It is measured from a starting date to an end date. It could be any time of the month or year.
The compliance period and error budgets reset when a calendar period is over. With this measurement logic, we measure the performance of the service at the end of the calendar period. Compliance ratings for calendars are generated only once, giving all the stakeholders an overall view of service performance.
- Rolling Period
A 'Rolling Window' period is a comparative measurement based on the current date versus the previous 1 to 30 days. Rolling windows help developers and operations teams get a sense of the trajectory of current and new practices/developments.
With a rolling-window period, you get multiple compliance measurements; this means you get the measure of the performance of a service for each day of the last 10/15/30 days. A rolling window gives you a more recent view of the compliance of your services.
Just as we have financial budgets, we also have error budgets. A well-designed SLO defines the degree of acceptable failures. This is called an error budget. It quantifies the failure of a service upgrade or change.
An error budget defines the quantum of bad individual events allowed to occur during a set period. This helps in making many important decisions. For example, if your error budget is finished by the end of a compliance period, publishing a new update does not make sense.
The depletion of an error budget indicates the service is not yet ready to be deployed in the real world. An ideal error budget goal for a compliance period can be defined as (1-SLO goal) x (eligible events in the compliance period).
For instance, if you design an SLO set at 85% accuracy, your error budget is for 15% failure. Let's say you received a total of 100 requests in the compliance period, of which only 50 satisfied your SLO requirements., Then, you have exceeded your error budgets, which means there is more work to be done.
Decision-Making Using Error Budgets and SLO
Error budget at the beginning of the compliance period starts with the maximum value and drops over time. Highlighting an SLO violation when the error budget falls below 0. However, before you jump to any conclusions, you should keep in mind a few exceptions to this pattern:
- If your request-based SLO is measured over a calendar period and the service is witnessing increased traction, your error budgets can be increased based on the new data.
- If measuring an SLO over a rolling period, you are always at the end of your compliance period. So when a period of poor compliance rolls out of the compliance window, and the present one is compliant, your error budgets will suddenly go up.
Establishing SLOs for your software services is an ideal way to measure the performance and align your team members on what to focus on.