Last9 Last9

Dec 20th, ‘22 / 8 min read

The difference between DevOps, SRE, and Platform Engineering

In reliability engineering, three concepts keep getting talked about - DevOps, SRE and Platform Engineering. How do they differ?

The difference between DevOps, SRE, and Platform Engineering

Over the past decade, software companies have increasingly turned to best practices like DevOps, CI/CD, containerization, and cloud-native systems as they switch from large teams to smaller autonomous units. Some in the industry use DevOps, SRE, and platform engineering interchangeably as part of their practices. But are these concepts the same?

  • The DevOps methodology more closely integrates the software development and operations teams to improve collaboration and shorten the time to market.
  • Site reliability engineering (SRE) practices apply software engineering principles to IT operations.
  • Platform engineering practices apply software engineering principles to infrastructure development.

How do these concepts differ, and what role do they play in the software-delivery pipeline? This article will examine each in more detail so that you can determine which is best for managing your deployments.

What Is DevOps?

DevOps is a combination of two areas, development and operations.

Previously, the software development lifecycle (SDLC) was rigid. Developers would create software and hand it over to the operations team to ensure service reliability. The separate teams were frequently out of sync. Developers wanted to release new features while the operations team demanded slower releases to ensure service stability.

The DevOps culture aims to solve disagreements between siloed teams. It enables a fast flow of development and operation processes, promoting continuous deployment and creating a cycle of close communication and high-level automation. The team that develops the code should be responsible for running and maintaining it.

DevOps mainly focuses on the following aspects:

  • Accelerating product delivery
  • Reducing the SDLC
  • Increasing the product’s adaptability to market needs

DevOps encourages the development and operations teams to work together to benefit both groups. It emphasizes continuous integration (CI) and continuous delivery (CD).

Why Use DevOps?

By more closely integrating the separate teams and increasing collaboration, DevOps improves software-delivery speed. Following are some reasons to adopt the DevOps culture:

  • Changes are incremental, and releases are frequent.
  • It’s easier to respond to feedback from the market.
  • Bug fixes and updates are delivered faster.
  • In case of issues, the changes are easy to roll back.
  • Teams can work to their fullest potential.

Using DevOps tools and practices, the operations team knows what the development team plans for future releases while developers understand production issues. It enables teams to understand customer needs better, build highly reliable applications, and achieve business objectives faster.

What is SRE?

Previously, system administrators handled most systems manually. Scaling was only possible vertically, and it took months to scale.

Horizontal scaling changed the way software development worked. With the advent of the cloud, the scaling factor went up to infinity. Organizations can spin up servers with just one click, meaning they need standard operating procedures (SOPs) to manage highly scalable infrastructure and maintain its reliability.

The Google engineering team launched the concept of site reliability engineering (SRE) in early 2000. SRE deals with services from the consumer’s perspective. It helps build a robust and dependable system focusing on service reliability and performance. Site reliability engineers understand the consumer context better than the development team.

SRE helps strike the right chord between releasing new features and ensuring the reliability of components for consumers. It solves the service-reliability problem by defining acceptable levels of availability and a plan of action in case of failure.

Since SRE is focused on creating reliable systems, teams need a way to measure the performance and reliability of service. SRE uses the following metrics:

Service Levels Indicators
A Service Level Indicator (SLI) is a unit measurement of service health. Various metrics are used for tracking it:

  • Request latency
  • Batch throughput
  • Failures per request

An SRE team defines the thresholds to better understand the services’ availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.

Service-Level Objective
A service-level objective (SLO) is an agreement among stakeholders that services should be reliable. SLIs are analyzed over a long period (usually quarterly, half-yearly, or yearly) to determine whether teams fulfilled the SLOs. For example, the following questions can help you assess performance in terms of SLOs:

  • How is the quality of service delivered to the customers?
  • How much time do customers lose during server downtime?
  • How many manual errors occurred?
  • How many escalations were requested by customers?

Service-Level Agreement

A service-level agreement (SLA) is an agreement between service providers and customers based on SLOs about what type of service the provider will offer and what will happen if that service is poor.

To sum up, SLIs drive SLOs, which inform SLAs. An SRE team usually monitors the performance of SLIs and SLOs.

What is Platform Engineering?

As organizations migrate their environments to the cloud, they need experts who can bridge the gap between developers and cloud infrastructure to leverage cloud platforms.

Platform engineering is the process of creating a self-service deployment platform that developers can use to deliver new releases quickly and reliably. It enables developers to become self-reliant in dealing with software-delivery setup despite a potentially minimal knowledge of infrastructure deployment.

Platform engineers are responsible for ensuring that new releases reach production quickly. They design and maintain platforms that make the software-delivery pipeline as efficient, stable, and consistent as possible. To rapidly push code out, they rely on automation and Infrastructure-as-code (IaC) tools and practices.

Essentially, platform engineers are development enablers. They help developers by making production deployment reliable and secure.

Platform engineering is a relatively new industry concept. It looks like SRE, but it only works on a specific problem. The role of a platform engineer is expected to become pivotal for the software-delivery pipeline as the industry evolves.

How SRE Implements DevOps Principles
Traditionally, development, testing, and operations teams were siloed, each specializing in pole-apart skill sets. DevOps practices bring these opposite poles together.

The CALMS acronym summarizes DevOps principles:

  • Culture
  • Automation
  • Lean
  • Measurement
  • Sharing

In the DevOps world, you improve processes by automating them, measuring the results obtained, and sharing that knowledge within the team. SRE is a practical implementation of DevOps principles. The following are examples of SRE responsibilities:

  • An SRE team uses software engineering to improve operations. It works in collaboration with the development team. A shared model of working is one of the foundational principles of SRE.
  • Improvements need to be made incrementally. That way, it is easier to identify and roll back any reliability issues in the service.
  • As with DevOps, it’s essential to measure change metrics often to meet SLOs.
  • An SRE team can adapt the deployment system to fit the codebase or project. Its goal is to reduce repetitive, mundane work because any time spent on those tasks is not spent on project work.
  • Working on production services comes with the risk of service outages. A root cause analysis (RCA) of an outage won’t seek blame, which reduces the time wasted on finger-pointing.

DevOps is a broad philosophy that enforces breaking down barriers between teams. SRE is narrow in its implementation by adding reliability standards and automation to run highly scalable distributed systems. Essentially, one philosophy builds on the other to streamline operations, eliminate organizational silos, and deliver high-quality software faster.

SRE vs. Platform Engineering

SRE and platform engineering share the goal of reducing manual effort, but they differ on the priorities and tools involved.

Platform engineering combines an organization’s workflow, automation, and APIs into one platform for a smooth software-delivery experience. In contrast, SRE focuses on ensuring that customer service is always reliable.

Following is a breakdown of the line between SRE and Platform Engineering:

Philosophy

SRE applies software engineering principles to improve service reliability.
PE applies software engineering principles to improve the software delivery pipeline.

Focus area

In SRE, the focus is on maintaining service performance with maximum availability. In PE, It’s about enabling quick code deployment using automation.

Measurement metrics

SREs use SLIs, SLOs, and SLAs to measure the changes made in the systems.
PE takes care of software delivery in a CI/CD pipeline to production.

Work area

Failure isn’t treated as such in SRE culture. RCAs are carried out to improve the system's reliability from the lessons learned during troubleshooting.
Workflows, APIs, and internal toolchains are brought onto a single internal platform to reduce the overhead involved in software development.

Collaborators

SRE involves working with developers and the operations team for maximum reliability of the software-delivery pipeline. PE works for developers to reduce the time for code to be pushed from source to production.

SRE and platform engineering can be seen as an evolution of DevOps, focusing on specific needs in the software-delivery pipeline.

Security: SRE and Platform Engineering

A lot of abstraction is introduced in the software-delivery methodology, and many businesses in the engineering ecosystem lack objective measurements of system security. Developers should build security controls and monitor into all application layers from the beginning.

Google’s core SRE principle is constant practice. In the SRE ecosystem, you should be testing the reliability of your system by running various experiments (a.k.a. chaos engineering) from the security point of view.

As per CatchPoint’s SRE report, an SRE engineer spends most of his time troubleshooting security incidents. An overwhelming number of data-security incidents result from vital changes made under pressure to meet a deadline.

It’s imperative to conduct thorough RCAs and not just rely on workarounds and quick fixes. SREs should be part of security discussions and system-design meetings because they play an essential role in system design.

A platform engineering team needs to implement security controls for networking policies so that potential failures don’t interrupt business operations. One way to ensure this is to follow these SRE-friendly security principles from the Open Web Application Security Project:

  • Minimize the available attack surface
  • Set up secure default data
  • Make use of the principle of least privilege
  • Keep security simple
  • Fix issues permanently and don’t rely on workarounds.

Security is one of the critical factors that will make your product successful. It’s vital to create a security team with the knowledge and tools to address these concerns.

When Do You Need DevOps, SRE, or Platform Engineering?

DevOps is an approach to software development that aims to introduce efficiency, collaboration, and innovation by breaking down the barriers between software developers and IT operations. The idea behind DevOps is that if you can enable developers to work closely with IT operations teams, you can increase the software delivery rate. That will allow them to achieve more goals faster, which will help your business grow.

Growing your business means growing your team, which will lead to an increased workload for your current software development process. You and your team need to formalize how knowledge is shared and standardized for service reliability. This is where SRE comes into play. SREs are responsible for making sure everything runs smoothly in production environments. They have access to all necessary tools and resources from development and operations teams to keep the service highly available.

As your development pipeline becomes more sophisticated, it also becomes more complex. Many processes and automation are running in the background, and platform engineering packages all of them into a single internal platform.

Platform engineering enables development teams to do a smoother and quicker code push. Your organization can scale up without your team getting lost in the process or forgetting what they did before scaling up. Team members can focus on building great products instead of worrying about how those products were made in the first place.

Most important for your organization, though, is understanding DevOps principles and how you can use them to support your software development goals. How you implement those principles is key.

Conclusion

Staying up-to-date with new and changing concepts in the software industry is an essential part of the job, but it can get frustrating for developers and team leads. DevOps, SRE, and platform engineering can all help you ensure faster code delivery, and they’re all based on the idea of working closely with the development team and breaking barriers between groups. However, each concept can address these goals in different ways.

In situations where downtimes aren’t critical, establishing a DevOps team is enough. If your systems are based on microservices and a cloud setup, a platform engineering team is essential. SRE, meanwhile, is crucial for critical systems.

Once you understand your needs, you can decide which of these concepts suits your organization and determine how to structure your team for the best results. On that note, if you're building a reliability charter for your org, or are constantly battling scaling challenges because of your Prometheus + Grafana, give Last9 a try.


Want to know more about how Last9 products? Check out how Last9 Levitate fares against all the popular time series databases - Prometheus, Influx, and M3Db.

Contents


Newsletter

Stay updated on the latest from Last9.

Authors
Prathamesh Sonpatki

Prathamesh Sonpatki

Prathamesh works as an evangelist at Last9, runs SRE stories - where SRE and DevOps folks share their stories, and maintains o11y.wiki - a glossary of all terms related to observability.

X