An Opinionated Guide to Post-mortems
There are many guides to post-mortems; this is mine.
Post-mortems are essential for any company that’s trying to do better over time, but many companies and teams do a mediocre job at running them. It even feels like some teams cargo-cult the practice, coming together for a ritualized meeting that they call a post-mortem in the hope that calling it that will invoke the Power of the Post-mortem and improve their organization.
This guide can’t be the right one for your organization. This guide won’t even be the right one for the next organization I join. A guide that you and your team collecatively talk through and commit to will be the one that’s the right one for your organization, but I hope this one helps.
Post-mortems are blameless
The only hard and fast rule about post-mortems is that they are blameless. Somebody might have pressed a button they shouldn’t have or pushed code they shouldn’t have, but the mindset we need is that a reasonable person did a reasonable thing based on the situation they found themselves in:
- Why didn’t our layers of safeguards catch that?
- Why did the person pushing the button think it was a reasonable idea?
- Why didn’t we have training / monitoring / logging / linting for this?
- Why was someone working late / stressed / in a hurry?
When you start blaming people, you do more than miss your chance to come up with ideas to make your system safer. The lesson it teaches people is that they will be punished for mistakes, that they should hide any culpability, and that every engineer should avoid work that involves risky changes. It’s poison for an engineering culture.
Blamelessness means being blameless with yourself too! The most common problem I see with post-mortems is action items that boil down to “Try Harder Next Time.” Those sorts of action items come from blaming yourself and thinking “I was an idiot. I should have done better.” Those thoughts make it impossible to figure out what actually went wrong and what realistic safeguards we can put in place. If another engineer was in your position, how would you make it easier for them to avoid making a mistake?
Anonymizing a post-mortem can help people get into the right blameless mindset. “Alice” didn’t make a mistake, an “engineer” did. I think anonymization is the right call for many teams, but it does lose useful signal. It’s not longer easy to see that “Bob,” who joined the company three days ago, was the one who pressed the scary button; should the group have a discussion about onboarding? And it’s no longer easy to see that “Charlie,” a frontend specialist, was the one to kick off the terrifying index build. Anonymization can make some discussions less likely, and it makes it harder to ask questions like “Eve, what was running through your mind when you saw the alert?” If you have a healthy-enough engineering culture where you can have blamelessness without anonymization, you should consider non-anonymized post-mortems, but most teams should default to anonymzed ones.
Who should come to a post-mortem?
Invite at least one other person! If it wasn’t a major issue, you don’t necessarily to invite the whole team, but having one other person there can drive much better conversations. Post-mortems are a spectrum so you should figure out the right group of people for whatever went wrong (or came close to going wrong).
Experienced engineers, even those not on your team, can be great people to invite to post-mortems because they likely have seen similar issues in the past and may have a much larger toolbox of potential action items and ideas that they can draw on.
Why timelines?
One of the first things I like to do with a post-mortem is sketch out a rough timeline. If it took 24 hours for us to discover a problem and only 24 minutes to fix, then it’s helpful to focus on discovering things sooner! If it took us a long time to get hold of a person who could change a config, then that’s a spot that we can improve. If it took a long time to debug something, then it’s helpful to chat through ways that we could improve our observability.
Sketching out that rough timeline can help us figure out where we should focus our energy. If we can avoid having the incident in the first place, that’s fantastic, but we should also focus on failure recovery when incidents do happen. Despite our best efforts, things do go wrong – sometimes things that aren’t even our “fault”! – and it’s our job to build systems so that we can fix things quickly, communicate to our users, and just generally make our product dependable.
No incident has just a single cause
If an outage had a single cause, it wouldn’t have happened. The only reason that something ever goes wrong is that there is a confluence of problems that bypass multiple safeguards. If your post-mortem had a single cause, it was a bad post-mortem!
- Why didn’t linting catch it?
- Why wasn’t there a test?
- Why wasn’t there a polyfill?
- Why didn’t we read the library description before including the library?
- Why didn’t the code review catch it?
- Why didn’t the LLM code review catch it?
- Why did we write the code that way to begin with? What guided us or an LLM towards those particular patterns? Why did we use an LLM to write code for this problem?
- Why didn’t our monitoring see that it was broken before our users did?
- Why weren’t we able to prioritize fixes for this footgun in the past? What structural forces keep us from cleaning up things like this when we encounter them?
There are a lot of questions like this that you can go through for even a “simple” post-mortem! Focusing on one thing that went wrong at the expense of anything else means that you’re only fixing things at one “layer.”
In general, I think there are three main areas that we can focus on during a post-mortem:
- Avoiding the mistake: linting, documentation, training, testing, in-line comments, improving the type system, using dual-read/dual-write patterns for a migration
- Detecting the mistake sooner: good alerting, maintaining low error rates so that new issues are obvious, “Every time an alert triggers, we either fix the problem or adjust the alert”
- Fixing faster: Running “fire-fighting” drills, using feature-flags to be able to turn features off, improving observability, CI/CD stability improvements that let us deploy faster
“Try Harder” isn’t an action item
If you look at many post-mortems, you’ll often have action-items like:
- Review code more carefully
- Always check things on multiple browsers
- Check the docs before writing code in a particular area
It’s good to try hard and do better! But these action items are expressions of the idea that the problem was that someone made a mistake, so we just need to “try harder” next time to avoid that pesky “human error.”
When I haven’t had coffee. I’m barely able to string sentences together, but I’ll still try to code – we need a system that is foolproof-enough to deal with Tired-Will. Our goal is to build a system that works for us when we’re at our worst.
Action items should be things that we can commit to.
If an action item from a post-mortem is “rewrite this entire system,” that’s not particularly helpful. We want to focus our discussion on things that we can commit to.
Common action items will look like:
- Add an alert on “lack of success” in addition to errors. (This is such a common gotcha btw! If you’re thinking about alerting, always make sure you set up an alert that ensures that things are working, not just not failing)
- Add new lint rules!
- Change types to make this class of error impossible
- Add a test for this class of errors.
- Add shared fixtures for this situation that make it easy for us to test this across multiple systems
- Run training for the team on a topic that we feel like we don’t know well: navigating logs, creating monitors, navigating a dashboard, etc.
- Run a fire-fighting drill
- Add a calendar event to check on a X thing every 3 months
Sometimes you do really need to rewrite a system! Or tackle some other larger project. In those cases, what I find helpful is action items like “Chat with $person about adding the rewrite for X system to the roadmap,” but even then, there are almost certainly things we can do today to make problems less likely. Having a plan for a long-term fix doesn’t preclude us adding short-term fixes too!
Sharing out post-mortems
One of the most important things you can do with a post-mortem is share it! It helps other people be aware of edge-cases and failure points in our systems, and it’s a great way to get more ideas and build an engineering culture that celebrates improvement over time.
None of this necessarily needs to be super formal! Even a quick slack message describing what went wrong and what you’re doing to make it less likely in the future can help other people avoid similar mistakes.
Book Recommendations!
(These are affiliate links because I love seeing when book recommendations land with people!)
- The Scout Mindset is one of my favorite books about thinking better because Julia Galef focuses so much on the emotional reasons that you might not want to see things clearly.
- The Logic of Failure is drily written, but I still think it’s fun! It dives into why complex systems – like the ones we build! – are hard to predict/understand and why failure is so common.
- The Google SRE Book (physical book) is a good read on Site Reliability Engineering practices in general, but I think their post-mortem essay is relatively weak. Despite that, many of the other essays are chock-full of ideas for action items to make systems safer.
- Weird rec: Seeing Like a State is a relatively academic critique of top-down social planning, but I really enjoyed it, and I feel like the ideas/terms end up being pretty useful in software.
- Seeing like a software company is a great read.
- Reading a Book review for it is probably good enough to get most of the value!
- The Score tackles some similar ideas, but mostly through the lens of games/play. I think it’s a little weaker than Seeing Like a State in terms of ideas, but it’s much more approachable.
Post Mortem Template
Summary
What happened? Why?
Timeline
This doesn’t need to be super detailed! We’re mostly trying to figure out where to focus our discussion.
Causes
When thinking about causes, there are three main areas we should be thinking about:
- How can we avoid this happening at all?
- If it did happen, how can we detect it sooner?
- When we do detect it, how do we make it quick to fix? How do we make the experience for users better while it’s broken?
There should always be multiple causes for each of these different areas.
Action Items
These action items should have an owner who’s committing to make them happen.
“Try Harder” isn’t an action item!