What does a nuclear accident in 1979 teach us about Site Reliability Engineering (SRE)?
Strap your belts. Me has a story.
Reliability engineering is measured in 9s because a 100 is an exercise in futility.
99
99.9
99.99
99.999
Hence, the name Last9 😉
The job of a Site Reliability Engineer is to improve these 9s. The more 9s one has, the more ‘reliable’ a system is. 99%, for example, is ‘two 9s’. Two nines in a year mean 3.65 days of downtime. Imagine not being able to use WhatsApp for 3.65 days.
So, if you operate in a world of acceptable failures, you have to chalk out what is ‘allowed’ in an organization’s context. This is where the concept of Service Level Agreements (SLA) comes in. SLAs give us a framework of 9s that are assured commitments. With these commitments, an organization can enforce accountability across teams, people, external vendors, and so on. It's a basic premise under which software accountability is chalked out.
Despite these agreements, there’s a certain philosophy under which SREs operate:
Normal Accidents.
The ‘Normal Accident' theory came from a book by Charles Perrow. He gave us a sociological perspective of what it takes to deal with complex systems.
"Normal Accident Theory holds that, no matter what organizations do, accidents are inevitable in complex, tightly-coupled systems"
Read that carefully again. Some words stand out.
"...no matter what..."
"...accidents are inevitable"
"tightly-coupled systems"
What this tells us: There is no 100. Strive, to fight for your 9s.
You see, all decision-making between humans and machines is interwoven with innumerable complexities. External and internal factors determine decisions, and orgs are complex organisms.
Mr. Perrow came up with a framework to understand ‘Normal Accidents’ after the 3-mile island accident — a nuclear accident in the US, considered the worst accident in commercial nuclear power plant history.
The cleanup took 14 years and cost about $1 billion.
‘Normal Accidents’ are unpreventable because they can’t be anticipated, and the addition of oversight and (potentially complex) safety measures merely add new failure modes. (Keeping it simple is underrated.😜)
A ‘Normal Accident’ can be characterized in 4 parts:
- Signals are noticed in retrospect
- Multiple design & equipment failures
- Operator errors are not considered errors until the accident is understood
- Negative Synergy - where the sum of equipment, design & operator error is far greater than the consequences of each.
Source: High-Reliability Org
Now, because of the coupling of systems and how interdependent they are, errors cascade rather dramatically. The pace at which this occurs is beyond human comprehension, given how multifaceted systems are.
Perrow calls these accidents inevitable. In an organizational context, this makes sense. Funny enough, he alludes to these accidents not as a function of its underlying technology but the people operating it.
It's always about the people. A side note: Think of instrumentation in o11y. It's not a technology problem. Seldom is. It's almost always a people one. A succinct explanation of this - https://last9.io/blog/observability-is-dead-long-live-observability/
A simpler observation from the nuclear fiasco was how complex systems were. The more rigid things are, the more failure loops one creates. Because of equipment & design, there's also a certain inevitability in the recurrence of disasters.
I find this framework interesting, given how we’re operating distributed microservices with complex tooling. A simple way to crystallize this theory into its practical applications is something one considers so mundane but is critical - documentation.
Is your organization's documentation distributed and accessible? Is it simple? How much chaos do your async platforms have by design to force documentation? (Killing tribal knowledge by auto-deleting slack messages) How easy is it for a new joiner to understand? Etc...
If we imbibe the obviousness of 'Normal Accidents', our ability to control randomness and chaos is far higher. o11y is about the simple things and doing them right. And Perrow’s work from 1984 is as relevant then as it is now.
And while ‘Normal accidents’ are inevitable, how are you thinking of Observability in a deluge of data? I’d love to know. DM, or please leave a comment. ✌️