I work as a Site Reliability Engineer at @last9io.
We help customers improve their business reliability. This can be for engineering, product, customer support, or any team that needs visibility into their complex microservices. My job is to ensure our clients have a minimal business impact and help them build their reliability to the last 9s.
In that endeavor, we often document a range of issues caught by Last9 to build an incident report. Such reports get shared across the organization. Different teams read them to understand the root causes and what are the recommended steps to mitigate these failures.
After doing multiple times, I thought I should share some learnings gleaned from writing such reports.
Before I start, I will share the obvious: Most of this is common sense. Think of these are good guardrails and the need to stick to the basics. Pruning words, getting to the point, and making sure there’s a concise flow to events as they transpired are what make these documents easy to read. Without further ado: 👇
- Every screenshot should be accompanied by a link and vice versa wherever possible. As a customer, if you don’t make it easy to read the story, chances are that it won’t be read. Attributions are important, and pictorial representations are easier to understand.
- Given the volume of alerts customers receive, you have to filter alert-worthy scenarios vs just interesting correlations that do not warrant urgent action; thereby reducing alert fatigue. There has to be a clear differentiator between an alarm and a notification. This inevitably means you have to understand the basics of your customer’s services, their incident priorities, and what really is P0 vs a P1/P2 for them. This is crucial to building and winning trust. In short: Use your “this is broken” conversation currency effectively.
- Documentation is all about good storytelling. It’s about the arrangement of facts, capturing the sequence of events as they transpired. If you’re not able to narrate this story in the simplest fashion, it needs to be reworked. This is easier said than done. I usually write the entire doc and let it simmer in my mind. I revisit it after 1-2 hours and take a stab at it again. This is when I see some obvious mistakes. It always helps if you can get a colleague to read it, and walk them through how you perceived the whole situation as it unfolded. Their retelling of the events will help you analyze the gaps in your storytelling capabilities.
- There’s fun in laying out the Lego blocks to link them later, but there is frustration in beating about the bush without getting to the point. Time is sparse. Get to the point, and get to it soon. Having a tl;dr helps. Optimize for readability. Giving them the choice to skip a section, or merely get a summary of others, will make sure your docs are being read consistently.
- If you think a Slack conversation with a link and a screenshot is good enough — think again. Links will go stale. You will forget what your screenshots meant. Spend those 15 minutes. Write it down in a structured, linear manner. Give context, share actionable.
- No tool or product can replace inquisitiveness. There will always be some missing connections. Don’t just do something, sit there and think about how the traffic flow dots are joined. These are the best moments to validate what you thought of how the traffic flow worked v/s how it actually unfolded.
- While debugging the issue, if you feel the pinch of – “If this was observed, we would’ve resolved the issue faster.” – then it’s an important source of your coverage backlog. See where the blind spots are and try to eliminate them. You can never do this 100%. Share this knowledge with your customer and showcase a clear plan on what the next steps are.
- Take pride in your debugging and sharing your work. Would you rather see a series of muted moving pictures or have a deep, thoughtful Morgan Freeman voiceover building the story with each frame? If you don’t enjoy it - it will always be a chore.
You can find me on Twitter @sphirani. If you want to help your organization be reliable beyond blind dashboarding, you must talk to us. Book a demo and we can walk you through interesting things we’re building in the reliability space. ✌️