Last9 Last9

Nov 8th, ‘22 / 7 min read

SLOs, SLIs, and SLAs: Understanding Key Service Metrics

A guide to set practical Service Level Objectives (SLOs) & Service Level Indicators (SLIs) for your Site Reliability Engineering practices.

SLOs, SLIs, and SLAs: Understanding Key Service Metrics

Businesses often want to balance feature development and service reliability while prioritizing customer happiness. With service level objectives (SLOs), stakeholders can set definite target levels for the reliability of their services.

One of the core duties of an SRE team is making the performance of your services measurable; if you can measure something, you can improve it.

Service level indicators are metrics that indicate the level of performance your users experience with your services. They help approximate the degree of happiness and satisfaction of your users. Service level indicators (SLIs) are vital signals an organization uses to measure how well certain service aspects are meeting its objectives.

This guide will walk you through setting practical service level objectives and service level indicators for your site reliability engineering practices.

Key Terms for Service Level Objectives and Service Level Indicators

You should know some key terms before getting started with service-level objectives and indicators.

Service reliability: Its reliability is the probability that a service, product, or system will adequately do what it is supposed to for a specific period. Your service reliability measures how well your system performs given conditions over time.

Site reliability engineering: Ben Treynor Sloss, credited with spearheading SRE, once said that SRE happens when you ask a software engineer to design an operations team.

It's a set of principles and practices that incorporates aspects of software engineering and applies them to infrastructure and operations problems. The main goals are to create scalable and highly reliable software systems.

Site reliability engineering leverages scripting, automation, and other software development techniques for IT operations to improve the reliability of your software products and the infrastructure that powers them.

Site reliability engineer: Site reliability engineers sit at the intersection of traditional IT/ops and software development. An SRE team would typically be responsible for the reliability of the entire stack of your services, from the front-end client-side applications to the backend, databases, and general infrastructure.

Service level agreement: A service level agreement defines the level of service expected by users. They also include penalties in case of agreement violation.

Why Does Your Organization Need SLOs and SLIs?

Collecting metrics from our applications and infrastructure can be inadequate for getting a clear picture of how users are experiencing your service. Service level indicators are quantifiable service performance measures from your users' perspective. SLI metrics correlate with your users' journey when they use your application.

The following are some examples:

- Latency: How long does it take for your service to respond to a request?
- Errors: What percentage of your service responses are errors?
- Traffic: How many requests are your services receiving?
- Availability: What percentage of the time are your services available?

Service level indicators form the basis of service level objectives by providing data that helps you set appropriate reliability targets.

Prioritizing service reliability is one-way organizations ensure user happiness. Measurable and realistic targets can help your organization better manage your services.

As mentioned, businesses often want to balance feature development and service stability; integrating service level objectives into your site reliability practices effectively makes these decisions more data-driven.

Head over to SLI vs SLO to better understand the differences between SLIs vs SLOs

How Can SLOs and SLIs Help Your Organization?

When appropriately implemented, SLIs and SLOs can be an asset to your organization. Some of their benefits include the following:

Measuring the performance of your system: With SLOs and SLIs, you can answer important questions about how your system is performing—for example, how fast your service returns data, how often the data is inaccurate—and set realistic targets for your system's performance.

Measuring the reliability of your system: SLIs can help you quantitatively measure the reliability of your system. Things like how much downtime your users have experienced in a given time, what parts of your system your users are having the most trouble, etc., can be easily quantified.

Measuring customer happiness: Because you design SLOs with user happiness in mind, it is easy to measure your users' satisfaction from using your service. While black box monitoring can give you an insight into how your application and infrastructure are performing, SLOs and SLIs are better suited for reflecting your users' experience.

Setting Service Level Objectives and Service Level Indicators

Getting all the stakeholders of the various services in your organization to agree on your reliability targets is the first step to setting SLOs. When setting reliability targets, it is advisable to give room for failure and violations of your SLOs before tripping off any alerts.

Stakeholders negotiate reliability compromises by operating on the universal truth that 100 percent reliability is unrealistic. Error budgets are one way to quantify these compromises. Once you have your error budget, you can set SLI and SLOs in your organization.

Identify System Boundaries

When your users are experiencing a slow service or your service is returning incorrect data, they will not know whether a database is lagging or a microservice is failing. The internal workings of your service architecture are irrelevant to your users. They will usually direct most of their concerns to the usability of your service.

Identifying your system boundaries can help your organization pinpoint the parts of your services your users interact with directly. Collecting metrics that reflect your users' experience makes setting reliability targets easier. A system boundary is where users interface with your services via one or more components.

To begin implementing your SLIs, you must consider how your users interact with your service. For example, a streaming service will have users who are concerned with things such as the following:

- Their video taking too long to start playing
- Their streams getting interrupted by buffering
- The accuracy of the search results when they look for a video
- The quickness with which your service returns data

When implementing SLIs, identify how your users interact with your services and collect metrics from the components that comprise the service as a group. So even though your video search service may contain components such as load balancers, databases, and various microservices, it is advisable to measure their performance from the perspective of your users.

Differentiate between Service Types

Understanding your system's capabilities and how they achieve their goals can help you set better SLIs and SLOs.

You should group your services into types. In most organizations, service types often coincide with team boundaries. The following are service-type examples:

Synchronous Services: These are services for which an immediate response to a query is expected. When a client sends a query to this service, all other events, services, or queries dependent on it must wait for its response to perform their tasks. Regarding the reliability of your synchronous services, you'll want to keep the latency and availability of your services in mind.

Asynchronous Services: These are services for which an immediate response is not expected. Other services do not rely on the response from async services and may continue processing other tasks. For async services, latency, service degradation, and task queues are some primary concerns when setting your reliability targets.

Stateful Services: These services track sessions, client transactions, or other services. With stateful services, such as databases, the processing of transactions will often rely on knowledge of previous transactions. This can restrict your infrastructure, as the same server must be used to process all related transactions. Saturation, availability, and data correctness are primary reliability concerns for stateful services.

Stateless Services: Stateless services do not need to keep track of their client sessions or require knowledge of previous transactions to perform their tasks sufficiently. For stateless services, like web servers, your organization can set reliability targets for availability, resilience, and latency.

Define Your Services' Scope of Availability

Next, you want to define in plain terms what it means for your service to meet user happiness. Defining your service's expected performance in plain terms can help standardize your reliability targets and get everyone on board with the organization's goals.

For example, although vague, saying that you want your service to have low latency might mean something to your engineers and software developers. However, you might be excluding business and product managers with this vocabulary. But when you say you want your service to respond to requests quickly, your reliability goals are clearer across the entire organization.

Choose the Right SLI Based on the Service Type

Now that you have carefully defined what it means for your services to be available in plain English, you can start the technical implementation of SLIs. When choosing your SLIs, always prioritize your users and how they interact with your services' different aspects. For this, user journeys illuminate how your users use your service.

Here are some recommended SLIs based on service types that your organization should consider:

For a user-facing system, availability (Are your services responding to requests?); latency (How long is it taking for your service to send responses?); and throughput (How many requests can your service handle in a given time frame?) are valuable metrics for measuring how happy your users are with your service.

For data pipelines, metrics such as correctness (Is the correct data is collected and returned?) and latency (How long is it taking for your pipelines to complete?) are practical metrics to consider.

Define Realistic SLOs for Each Metric Based on the SLIs Provided

Now that you have implemented metrics that best reflect your users' happiness use the data gathered to set your service level objectives. SLOs are target values or a range of values that define the upper bounds of the reliability of your service.

To set realistic SLOs, you should consider the reliability baseline for your service. For example, there's no point setting a 99% threshold if your service availability is 85%. The data from your SLIs and your users' feedback should inform the baselines of your SLOs.

Once you have successfully put these targets in place, it becomes easier to gauge how satisfied your users are with your services, thus balancing innovation and service stability.

Iterate the process to fine-tune SLOs over time.

Remember that your organization keeps evolving, and your users' needs change over time; your reliability target should be dynamic.

Lastly, feedback is vital when fine-tuning your SLOs over time. Always consult stakeholders, users, and historical data to inform your SLOs.

Conclusion

Setting adequate SLIs and SLOs can improve your services and systems. When deciding on your reliability targets, paying attention to your user journeys and system boundaries is essential. You should always set SLIs based on how users interact with your system, and don't be shy about setting different targets for different aspects of your system.

Begin your process by defining an error budget to make room for failure, thus avoiding unrealistic expectations for your service reliability. Remember that your organization will evolve as well as your users' needs. Revisiting your SLOs and fine-tuning them over time is the best practice.

To help manage your products' reliability at any scale, Last9 is a reliability platform that helps DevOps engineers, and SRE teams set SLOs faster by automatically measuring baselines and providing SLI and SLO suggestions. You can also catalog your services, map their relationship, and get change intelligence.

💡
Want to know more about Last9 and our products? Check out last9.io; we're building reliability tools to make running systems at scale, fun, and embarrassingly easy. 🟢

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Last9

Last9

Last9 helps businesses gain insights into the Rube Goldberg of micro-services. Levitate - our managed time series data warehouse is built for scale, high cardinality, and long-term retention.

X