Businesses often want to balance feature development and service reliability while prioritizing customer happiness. With service level objectives (SLOs), stakeholders can set definite target levels for the reliability of their services.
One of the core duties of an SRE team is making the performance of your services measurable; if you can measure something, you can improve it.
Service level indicators are metrics that indicate the level of performance your users experience with your services. They help approximate the degree of happiness and satisfaction of your users. Service level indicators (SLIs) are vital signals an organization uses to measure how well certain aspects of its services are meeting its service level objectives.
This guide will walk you through setting practical service level objectives and service level indicators for your site reliability engineering practices.
Key Terms for Service Level Objectives and Service Level Indicators
There are some key terms you should know before getting started with service level objectives and service level indicators.
Service reliability: The probability that a service, product, or system will adequately do what it is supposed to for a specific period is its reliability. Your service reliability measures how well your system performs given a set of conditions over a particular time.
Site reliability engineering: Ben Treynor Sloss, credited with spearheading SRE, once said that SRE is what happens when you ask a software engineer to design an operations team.
It's a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems.
Site reliability engineering leverages scripting, automation, and other software development techniques for IT operations to improve the reliability of your software products and the infrastructure that powers them.
Site reliability engineer: Site reliability engineers sit at the intersection of traditional IT/ops and software development. An SRE team would typically be responsible for the reliability of the entire stack of your services, from the front-end client-side applications to the backend, databases, and general infrastructure.
Service level agreement: A service level agreement defines the level of service expected by users. They also include the penalties in case of agreement violation.
Why Does Your Organization Need SLOs and SLIs?
Collecting metrics from our applications and infrastructure can be inadequate for getting a clear picture of how users are experiencing your service. Service level indicators are quantifiable measures of an aspect of service performance from your users' perspective. SLI metrics correlate with your users' journey when they use your application.
The following are some examples:
- Latency: How long does it take for your service to respond to a request?
- Errors: What percentage of your service responses are errors?
- Traffic: How many requests are your services receiving?
- Availability: What percentage of the time are your services available?
Service level indicators form the basis of service level objectives by providing data that helps you set appropriate reliability targets.
Prioritizing service reliability is one way organizations ensure user happiness. Measurable and realistic targets can help your organization better manage your services.
As mentioned, businesses often want to balance feature development and service stability; integrating service level objectives into your site reliability practices is an effective way to make these decisions more data driven.
How Can SLOs and SLIs Help Your Organization?
When appropriately implemented, SLIs and SLOs can be an asset to your organization. Some of their benefits include the following:
Measuring the performance of your system: With SLOs and SLIs, you can answer important questions about how your system is performing—for example, how fast your service returns data, how often the data is inaccurate—and set realistic targets for your system's performance.
Measuring the reliability of your system: SLIs can help you quantitatively measure the reliability of your system. Things like how much downtime your users have experienced in a given time, what parts of your system your users are having the most trouble, etc. can be easily quantified.
Measuring customer happiness: Because you design SLOs with user happiness in mind, it is easy to measure the level of satisfaction your users get from using your service. While black box monitoring can give you an insight into how your application and infrastructure are performing, SLOs and SLIs are better suited for reflecting your users' experience.
Setting Service Level Objectives and Service Level Indicators
Getting all the stakeholders of the various services in your organization to agree on your reliability targets is the first step to setting SLOs. When setting reliability targets, it is advisable to give room for failure and violations of your SLOs before tripping off any alerts.
Stakeholders negotiate reliability compromises by operating on the universal truth that 100 percent reliability is an unrealistic target. Error budgets are one way to quantify these compromises. Once you have your error budget, you can begin setting SLI and SLOs in your organization.
Identify System Boundaries
When your users are experiencing a slow service or your service is returning incorrect data, they will not know whether a database is lagging or a microservice is failing. The internal workings of your service architecture are irrelevant to your users. They will usually direct most of their concerns to the usability of your service.
Correctly identifying your system boundaries can help your organization pinpoint the parts of your services your users interact with directly. Collecting metrics that reflect your users' experience makes setting reliability targets easier. A system boundary is where your users interface with your services via one or more components.
To begin implementing your SLIs, you must start thinking about how your users interact with your service. For example, a streaming service will have users who are concerned with things such as the following:
- Their video taking too long to start playing
- Their streams getting interrupted by buffering
- The accuracy of the search results when they look for a video
- The quickness with which your service returns data
When implementing SLIs, identify how your users interact with your services and collect metrics from the components that comprise the service as a group. So even though your video search service may contain components such as load balancers, databases, and various microservices, it is advisable to measure their performance from the perspective of your users.
Differentiate between Service Types
Understanding your system's capabilities and how they achieve their goals can help you set better SLIs and SLOs.
You should group your services into types. In most organizations, service types often coincide with team boundaries. The following are service type examples:
Synchronous Services: These are services for which immediate response to a query is expected. When a client sends a query to this service, all other events, services, or queries dependent on this service must wait on its response to perform their tasks. Regarding the reliability of your synchronous services, you'll want to keep the latency and availability of your services in mind.
Asynchronous Services: These are services for which an immediate response is not expected. Other services are not reliant on the response from async services and may continue processing other tasks in the meantime. For async services, latency, service degradation, and tasks queues are some primary concerns when setting your reliability targets.
Stateful Services: These are services that keep track of sessions or client transactions made or other services. With stateful services, such as databases, processing of transactions will often rely on knowledge of previous transactions. This can put some restrictions on your infrastructure, as the same server must be used to process all related transactions. Saturation, availability, and data correctness are some primary reliability concerns for stateful services.
Stateless Services: Stateless services do not need to keep track of their client sessions, nor do they require knowledge of previous transactions to perform their tasks sufficiently. For stateless services, like web servers, your organization can set reliability targets for availability, resilience, and latency.
Define Your Services' Scope of Availability
Next, you want to define in plain terms what it means for your service to meet user happiness. Defining your service's expected performance in plain terms can help standardize your reliability targets and get everyone on board with the organization's goals.
For example, saying that you want your service to have *low latency*, although vague, might mean something to your engineers and software developers. However, you might be excluding business and product managers with this vocabulary. But when you say that you want your service to respond to requests quickly, your reliability goals are clearer across the entire organization.
Choose the Right SLI Based on Service Type
Now that you have carefully defined what it means for your services to be available in plain English, you can start the technical implementation of SLIs. When choosing your SLIs, always prioritize your users and how they interact with your services' different aspects. For this, user journeys are instrumental in illuminating how your users use your service.
Here are some recommended SLIs based on service types that your organization should consider:
For a user-facing system, availability (Are your services responding to requests?); latency (How long is it taking for your service to send responses?); and throughput (How many requests can your service handle in a given time frame?) are valuable metrics for measuring how happy your users are with your service.
For data pipelines, metrics such as correctness (Is the correct data being collected and returned?) and latency (How long is it taking for your pipelines to complete?) are practical metrics to consider.
Define Realistic SLOs for Each Metric Based on the SLIs Provided
Now that you have implemented metrics that best reflect your users' happiness, use the data gathered to set your service level objectives. SLOs are target values or a range of values that define the upper bounds of the reliability of your service.
To set realistic SLOs, you should consider the baseline of what reliability looks like for your service. For example, there's no point setting a 99% threshold if your service availability is 85%. The data from your SLIs as well as feedback from your users should inform the baselines of your SLOs.
Once you have successfully put these targets in place, it becomes easier for you to gauge how satisfied your users are with your services, thus balancing innovation and service stability.
Iterate the process to fine-tune SLOs over time
Bear in mind that your organization keeps evolving and your users' needs change over time; your reliability target should be dynamic.
Lastly, feedback is vital when fine-tuning your SLOs over time. Always consult stakeholders, users, and historical data to inform your SLOs.
Setting adequate SLIs and SLOs can improve your services and systems. When deciding on your reliability targets, it is essential to pay attention to your user journeys and system boundaries. You should always set SLIs based on how users interact with your system, and don't be shy about setting different targets for different aspects of your system.
Begin your process by defining an error budget to make room for failure, thus avoiding unrealistic expectations for your service reliability. Remember that your organization will evolve as well as your users' needs. Revisiting your SLOs and fine-tuning them over time is the best practice.
To help manage your products' reliability at any scale, Last9 is a reliability platform that helps DevOps engineers and SRE teams set SLOs faster by automatically measuring baselines and providing SLI and SLO suggestions. You can also catalog your services, map the relationship between them, and get change intelligence.
Want to know more about Last9 and our products? Check out last9.io; we're building reliability tools to make running systems at scale, fun, and embarrassingly easy. 🟢