An Opinionated Guide to Post-mortems

There are many guides to post-mortems; this is mine.

Post-mortems are essential for any company that’s trying to do better over time, but many companies and teams do a mediocre job at running them. It even feels like some teams cargo-cult the practice, coming together for a ritualized meeting that they call a post-mortem in the hope that calling it that will invoke the Power of the Post-mortem and improve their organization.

This guide can’t be the right one for your organization. This guide won’t even be the right one for the next organization I join. A guide that you and your team collecatively talk through and commit to will be the one that’s the right one for your organization, but I hope this one helps.

Post-mortems are blameless

The only hard and fast rule about post-mortems is that they are blameless. Somebody might have pressed a button they shouldn’t have or pushed code they shouldn’t have, but the mindset we need is that a reasonable person did a reasonable thing based on the situation they found themselves in:

When you start blaming people, you do more than miss your chance to come up with ideas to make your system safer. The lesson it teaches people is that they will be punished for mistakes, that they should hide any culpability, and that every engineer should avoid work that involves risky changes. It’s poison for an engineering culture.

Blamelessness means being blameless with yourself too! The most common problem I see with post-mortems is action items that boil down to “Try Harder Next Time.” Those sorts of action items come from blaming yourself and thinking “I was an idiot. I should have done better.” Those thoughts make it impossible to figure out what actually went wrong and what realistic safeguards we can put in place. If another engineer was in your position, how would you make it easier for them to avoid making a mistake?

Anonymizing a post-mortem can help people get into the right blameless mindset. “Alice” didn’t make a mistake, an “engineer” did. I think anonymization is the right call for many teams, but it does lose useful signal. It’s not longer easy to see that “Bob,” who joined the company three days ago, was the one who pressed the scary button; should the group have a discussion about onboarding? And it’s no longer easy to see that “Charlie,” a frontend specialist, was the one to kick off the terrifying index build. Anonymization can make some discussions less likely, and it makes it harder to ask questions like “Eve, what was running through your mind when you saw the alert?” If you have a healthy-enough engineering culture where you can have blamelessness without anonymization, you should consider non-anonymized post-mortems, but most teams should default to anonymzed ones.

Who should come to a post-mortem?

Invite at least one other person! If it wasn’t a major issue, you don’t necessarily to invite the whole team, but having one other person there can drive much better conversations. Post-mortems are a spectrum so you should figure out the right group of people for whatever went wrong (or came close to going wrong).

Experienced engineers, even those not on your team, can be great people to invite to post-mortems because they likely have seen similar issues in the past and may have a much larger toolbox of potential action items and ideas that they can draw on.

Why timelines?

One of the first things I like to do with a post-mortem is sketch out a rough timeline. If it took 24 hours for us to discover a problem and only 24 minutes to fix, then it’s helpful to focus on discovering things sooner! If it took us a long time to get hold of a person who could change a config, then that’s a spot that we can improve. If it took a long time to debug something, then it’s helpful to chat through ways that we could improve our observability.

Sketching out that rough timeline can help us figure out where we should focus our energy. If we can avoid having the incident in the first place, that’s fantastic, but we should also focus on failure recovery when incidents do happen. Despite our best efforts, things do go wrong – sometimes things that aren’t even our “fault”! – and it’s our job to build systems so that we can fix things quickly, communicate to our users, and just generally make our product dependable.

No incident has just a single cause

If an outage had a single cause, it wouldn’t have happened. The only reason that something ever goes wrong is that there is a confluence of problems that bypass multiple safeguards. If your post-mortem had a single cause, it was a bad post-mortem!

There are a lot of questions like this that you can go through for even a “simple” post-mortem! Focusing on one thing that went wrong at the expense of anything else means that you’re only fixing things at one “layer.”

In general, I think there are three main areas that we can focus on during a post-mortem:

“Try Harder” isn’t an action item

If you look at many post-mortems, you’ll often have action-items like:

It’s good to try hard and do better! But these action items are expressions of the idea that the problem was that someone made a mistake, so we just need to “try harder” next time to avoid that pesky “human error.”

When I haven’t had coffee. I’m barely able to string sentences together, but I’ll still try to code – we need a system that is foolproof-enough to deal with Tired-Will. Our goal is to build a system that works for us when we’re at our worst.

Action items should be things that we can commit to.

If an action item from a post-mortem is “rewrite this entire system,” that’s not particularly helpful. We want to focus our discussion on things that we can commit to.

Common action items will look like:

Sometimes you do really need to rewrite a system! Or tackle some other larger project. In those cases, what I find helpful is action items like “Chat with $person about adding the rewrite for X system to the roadmap,” but even then, there are almost certainly things we can do today to make problems less likely. Having a plan for a long-term fix doesn’t preclude us adding short-term fixes too!

Sharing out post-mortems

One of the most important things you can do with a post-mortem is share it! It helps other people be aware of edge-cases and failure points in our systems, and it’s a great way to get more ideas and build an engineering culture that celebrates improvement over time.

None of this necessarily needs to be super formal! Even a quick slack message describing what went wrong and what you’re doing to make it less likely in the future can help other people avoid similar mistakes.

Book Recommendations!

(These are affiliate links because I love seeing when book recommendations land with people!)


Post Mortem Template

Summary

What happened? Why?

Timeline

This doesn’t need to be super detailed! We’re mostly trying to figure out where to focus our discussion.

Causes

When thinking about causes, there are three main areas we should be thinking about:

  1. How can we avoid this happening at all?
  2. If it did happen, how can we detect it sooner?
  3. When we do detect it, how do we make it quick to fix? How do we make the experience for users better while it’s broken?

There should always be multiple causes for each of these different areas.

Action Items

These action items should have an owner who’s committing to make them happen.

“Try Harder” isn’t an action item!