Software bugs can cause major issues, take up significant resources to fix, and even lead to fatalities in some cases. That’s why reviewing and reflecting on your overall project and processes is crucial for continuous improvement and avoiding future incidents.
A postmortem is a follow-up process completed at the end of a project to allow a team to reflect on the entire software development lifecycle (SDLC).
The team looks at both successes and failures, analyzing what can be learned and improved. It’s a collaborative effort that involves all stakeholders, encouraging contributions from everyone, and ultimately building trust within your organization.
Postmortems and Incident Management
Postmortems are key to incident management. They help identify the root cause of a problem and create an action plan to mitigate future issues.
Additionally, they can uncover patterns between seemingly unrelated incidents. This is especially true for incident postmortems, which are vital for understanding and improving your incident response.
The Value of Blameless Postmortems
Implementing blameless postmortems promotes a culture of growth, helping teams focus on solutions rather than placing blame.
Every time a postmortem is conducted, it reduces the chances of the same bug reoccurring, safeguarding against future code issues.
When to Conduct an Incident Postmortem
Incident postmortems are particularly essential after an outage or significant failure, ensuring that key metrics are reviewed and lessons are learned.
Even if postmortems aren’t conducted regularly, they are essential when a bug should have been caught earlier or when a bug causes significant defects in your system.
Implementing a postmortem process ensures that your team can catch issues before they escalate while documenting areas for growth and improvement.
Best Practices to Improve Your Postmortem Process
Before implementing a postmortem process, create a postmortem template that includes key details to ensure consistency and clarity. This helps streamline your efforts and ensures all necessary points are covered. Your template should include:
- Who owns the postmortem (and who will do the analysis)
- When the incident happened
- Lessons learned
- A timeline of the incident
- Action items
A well-structured template ensures that your team learns from each incident, making the root cause analysis clear and actionable. It also helps prevent significant variations in the postmortem report when different team members are involved.
Postmortems Aren’t for Blaming
A successful postmortem should not focus on blaming individuals or teams. Instead, the goal is to understand how something happened, not who is responsible.
This approach encourages learning and improvement rather than defensiveness. During a post-mortem meeting, ensure that everyone feels comfortable contributing.
- Focus on fact-finding and how the issue occurred.
- Put up safeguards to prevent the recurrence of the issue.
- Challenge your processes for continuous improvement.
Postmortems also provide an opportunity to revisit workflows. If team members follow outdated methods, this is a chance to innovate and improve. You can use real-time assessments to analyze the incident and uncover patterns. This is especially useful for on-call teams to handle high-pressure situations and for identifying follow-up actions.
To improve efficiency, consider using tools to automate parts of the process, allowing you to focus on key takeaways. Incorporating DevOps principles will help integrate this feedback loop into your continuous delivery cycle, further streamlining your postmortem process.
Involve the Author
A postmortem answers simple questions and notes where processes can be improved. Make sure each postmortem is written by people who were involved in the incident. If you have someone writing it that wasn’t involved in the process, you can lose important context and details.
Things like the ordering of events or the steps that were taken by multiple parties to resolve incidents can get mixed up and become confusing to learn from. The best people to explain and remediate are the individuals who were directly involved in the incident.
Postmortems Need Actionable Steps
If at the end of a postmortem, you have actions like improving code quality and ensuring bad releases don’t go to prod, you’re not diving deep enough into why the accident happened.
To gain meaningful insight you need to create actionable steps. An actionable step is one in which you address the issue head-on and start making changes immediately.
A good postmortem will include comments like:
- The code quality in Omega codebase is making it incredibly difficult for developers to know their fix works. You need a week to write more robust unit tests.
- The automated checks before a release keep showing false positives and aren’t correctly highlighting when you have real failures in the code.
These comments can be actioned, and prove you’ve done the correct level of investigation.
For instance, Google has one action item from each of its core principles of incident detection: detection, prevention, and mitigation in an attempt to ensure each incident won’t occur again.
Seek Feedback
It’s important to determine if your postmortems are doing their job of helping to identify issues and prevent and mitigate issues in the future. If the same issues are recurring, the postmortem format you’re using needs to be updated.
You should ask for (and provide) feedback on the postmortems your team writes. It’s possible that there’s missing knowledge or tools that you’re not aware of that can help improve your team and its postmortem process.
You should continually seek feedback so that your processes continue to grow and improve and so that your team gets better and better at catching mistakes before they happen.
Who you should seek feedback from is an important question!
Feedback is often sought from the team responsible for the service, gleaning differing experiences from junior to principal developers.
A lot of big software teams (Microsoft, Google, etc) open source their postmortems, but it’s normally not for feedback on how to improve their systems, but for good communication to assure their customers this likely won’t happen again.
Postmortem examples from around the world!
To better understand how the postmortem process works, let’s look at some companies that have implemented them well.
Amazon
In February of 2017, the Amazon S3 team was debugging a minor issue, and in that process issued a command to remove a small number of servers. However, the command was issued with a typo and it removed a larger set of servers than intended. Because these servers support critical systems, the dependent systems also require a full restart to function properly.
This caused a vast cascading failure since Amazon’s own services like EC2 and EBS rely on the servers. The failure ended up affecting hundreds of other companies in the process.
The incident was resolved roughly four hours later.
In this instance, Amazon found that the tool responsible for the removal of servers wasn’t strict enough. It allowed too many servers to be easily removed and so the company updated its processes to ensure servers would be removed more slowly in the future. Amazon also added safeguards to prevent servers from being removed if it would take any system/subsystem below its minimum required capacity level.
This limits the chances of this incident happening again, and ensures better safeguards for its customers.
keepthescore
Thankfully, it utilizes a managed database from DigitalOcean that offers daily backups. However, there were about seven hours during which the data was permanently deleted.
The database calls it did were hard-coded to run only on a local machine (not the production machine) and through human error had been run on the wrong server.
Through its postmortem process, keepthescore was able to classify the database call that deleted the table as “too dangerous” since it wasn’t able to test it safely. It then made the technical decision to remove the code completely and test its backup system’s speed of recovery to prevent the same problem in the future.
If you want to look at even more postmortems from other companies like Google, GitHub, Linux, and Spotify, you can explore other examples below!
Conclusion
In this article, you’ve learned about what a postmortem is and why you should conduct one. You also learned about best practices you should incorporate into your postmortem process.
A good postmortem will feel collaborative and build team unity while ensuring the continuous improvement of your organization. Take a look at other postmortem examples and then get started implementing your own postmortem process today.
(A big thank you to Kealan Parr for his contribution to this article)