Any organization (or individual, for that matter) that desires to improve needs to be able to both have a frank discussion of what went wrong after a failure, as well be able to generate and execute on actionable steps to take that would prevent or mitigate the same failure in the future.
In software development, I've found success when we've had a process that ensures both that postmortems occur and that they are followed up on, as well as a culture whereby those postmortems are blameless, focusing on process and technology failures, rather than individual performance.
The goal of any postmortem is to produce improvements that prevent subsequent failures of the same or similar kind. If your postmortems are not producing improvements, or those improvements are not preventing recurring failures, why have them at all?
Given this goal, we can say a few things about what needs to happen as part of any postmortem:
- The discussion about any incident needs to be frank and truthful, or you may not get to the true root cause
- The process must produce actionable changes that are under your control; there's no use wishing the world were different
- There must be organizational buy-in to not only spending time on postmortems, but to spend time (and money) on the resulting action items
When is a Postmortem Appropriate?
I would suggest performing a postmortem after every major service interruption incident, though what "major" means can vary from organization to organization. Engineering Leadership should help to define this, though I would suggest you can get value out of holding postmortems more often that you may be initially comfortable.
If you are unsure if an incident warrants a postmortem, ask yourself, "was the incident severe or frequent enough that we would be likely to prioritize whatever action items came out of it?"
Documenting the Incident
For any postmortem to be effective, everyone involved needs to have a clear understanding of what happened and when. Thus, a good portion of the success of a postmortem falls to the Incident Owner, who works to prepare a Postmortem Summary beforehand.
The Summary should include the following:
- A short summary of the incident
- A description of the impact of the incident
- As complete a timeline of the incident as can be constructed, including times of the initial failure, initial detection, response, any remediation steps taken, and final resolution. Pulling in chats (with timestamps) and relevant charts from the incident as supporting data is a good idea.
- An identification of the root cause, if possible
- The document should also leave a blank sections for Action Items
Ideally, the first three sections of the document should be filled out before the postmortem, though it's okay to spend the first part of the meeting fleshing out the timeline in more detail if necessary. Note that the Incident Owner is responsible for making sure the document gets filled out, but should encourage others who were involved in the incident response to assist in filling out various sections.
Scheduling the Postmortem
The postmortem meeting should be scheduled as soon as practically possible after an incident, ideally the next business day. However, if some of the team needs additional time to investigate the root cause of an incident, it's appropriate to delay a few days -- the discussion will be much more fruitful with such information at hand.
Meeting culture varies between organizations, though I think half an hour should be sufficient for most postmortems, and more straightforward incidents can be covered in 10-15 minutes. Attendees should include those involved in incident resolution + any relevant stakeholders.
The Blameless Postmortem
At all times, the Postmortem should strive to be blameless, focusing on faulty processes and technology. Where people are named in the Summary, it should be to indicate that someone noticed the issue, identified an important piece of information, or took some action to attempt remediation. The purpose of the meeting is not to point fingers; if people feel like they are potentially going to be blamed for an incident, they may hide or falsify information in an attempt to "cover their ass", which will only serve to undercut your efforts to identify appropriate action items to take.
The meeting should begin by reviewing the Postmortem Summary document and making sure everyone agrees with the impact, timeline, and root cause. Now is the time to fill in any missing information before moving on to generating action items.
Once we agree what happened, we can start to discuss what failed. Obviously, if we can identify the root cause of the incident, we can discuss what steps we should take to prevent that from happening again, but it's often the case with major incidents that there are multiple failures along the way. In particular, we can ask the following:
- Did it take longer to detect the incident or page an individual that it should have? Was sufficient monitoring in place?
- Was the first person to respond the right person/job function, or did it take several escalations to get to someone who could resolve the issue?
- Was the incident more impactful that it could have been because secondary systems were not resilient to the failure of the initial system?
- Did resolution take longer than it should have because relevant documentation was missing or out-of-date?
Often, the five whys exercise can be helpful in not only determining the root cause, but in sussing out other fundamental gaps in your processes. I would use it here whenever I have an incident that was sufficiently complex.
Hopefully by now we've not only agreed on the root cause of the incident, but we're generated a solid list of other failures that occurred along the way. Now is the time to turn those failures into action items. For each failure, we ask, "what could we have reasonably done that would have either mitigated or prevented the failure entirely?"
The best action items are low-effort and high-impact -- they are relatively simple things that are likely to benefit you in the future, either because the triggering conditions are frequent, or because the impact of the failure mode is high. Regardless, it's best to start by listing all of the possible things you could do to prevent failure, and then whittle down the list to those items that are likely to get prioritized; low-effort, high-impact, or both. Everything else is likely a distraction; either you spend time on corner cases when you could be doing more impactful work, or you fill your backlog with work you're almost positive will never actually get done. Either outcome can hurt the perception within your organization that postmortems are a valuable exercise.
Postmortem Follow Through
Now that you've completed the postmortem meeting, you have two main follow-up tasks:
- Publish your findings widely. Send out the Postmortem Summary doc, including Action Items, to your engineering email list. You may want to occasionally take more interesting, systemic failures and present the incident and resolution to the engineering team. The point here is education, as the more people who know and understand the story of this particular failure, the more who may incorporate your learnings into future development and processes.
- Make sure there is follow-through on Action Items. Make sure that each item has a person or a team assigned to it, and that the item is prioritized appropriately in a backlog. Generally these are things people know should get done, but perhaps don't always find the time, so gentle nags every few weeks may be appropriate. If you consistently find it difficult to get Action Items prioritized, either you need to take a hard look at your prioritization process to ensure that important technical debt does get priority, or perhaps you may be too freely committing to Action Items that aren't actually that urgent, and need to adjust accordingly.
Like any process aimed at improvement, effective postmortems won't occur (certainly not with any regularity) without organization-wide support, and a little guidance. Fortunately, many of the things you could do are simple and low-effort, including:
- Provide a template for Postmortem Summary documents, perhaps along with a centralized repository for such documents.
- Ensure that postmortems happen. Task someone with following up after every incident to identify the Incident Owner and ensure that they schedule a postmortem as appropriate.
- Make sure teams know how to run a postmortem. There's lots written already on how to run an effective meeting -- the key here is that everyone understands how to approach the postmortem in a blameless manner.
- Ensure that Action Items are followed up on. For items that require significant effort/resources, push for impactful Action Items to be resourced and prioritized appropriately.
- Broadcast important learnings from postmortems, helping to demonstrate the value of the process while you're educating your staff.
No process should be cargo cult imported wholesale into a new organization; each organization will have its own needs and culture, to which the process must adapt. The postmortem process I've outlined is no different, though I cannot stress enough how valuable a process like this can be in iteratively improving an organization, provided that that organization wants to improve. You may not see it at first, I've found week after week of small improvements like this improve stability and maintainability dramatically over time.