The Post-mortem Spectrum

I make mistakes. Lots of them. And when I make a mistake, I try to fix things so that I can’t make that same mistake again.

When I was just starting out, the way I tried to avoid repeating mistakes was action items like “code better,” “think more,” “remember the edge cases,” and just generally “try harder.” These are poor action items.

As I gained experience, I started to try to understand the reasons that made a particular mistake likely. Perhaps a function had a confusing name and our fixtures weren’t comprehensive enough to make it easy write a good test. Perhaps we were missing some logging in dev that would have made debugging easy. Perhaps our default options weren’t safe ones and our engineering onboarding lacked some important topics.

If you squint, this looks like the world’s tiniest post-mortem. The questions are the same: How could I have discovered this sooner? How can I reduce the impact if something goes wrong? Why was this confusing? How could I make this safer and easier to work on? And the mindset is the same too: when there’s a mistake, you never just fix the mistake – you fix the reasons the mistake was possible to begin with.

But… so what? Why does it matter that the day-to-day thoughts we go through to leave the codebases we work in better than we found them is the same thought process that we go through when running a “real” post-mortem?

The reason I think this matters, and the mistake that I think some teams make, is that they think of post-mortems as something that only ever happens when something “big” goes wrong. Post-mortems, the thinking goes, are a relatively time-intensive structured activity that the team hopefully only needs to do once or twice a year (if that).

This type of post-mortem – the super-serious one after something important has gone wrong – is useful! But I think there is a spectrum of post-mortem-like activities that engineers should be doing regularly to make the systems and code that they work on more resilient:

Jotting down notes about something that went wrong for people to engage with asynchronously
Grabbing one other engineer to chat through root causes of a problem and potential fixes
Having a few engineers stick around after a standup to talk through how we could avoid a particular bug
Depending on how heavy the post-mortem process is at your company, even scheduling a small meeting to talk through an issue might still be much lighter than a full post-mortem

All of these things are (hopefully!) regular practices at most companies, but thinking of them as points on a spectrum of post-mortem activities makes post-mortems as a whole more approachable and likely to happen. I believe that a lot of an engineering team’s ability to move fast comes from reflection and improvements to the system. If you can create an engineering culture that encourages this sort of reflection for small problems, I think you can make big problems much less likely, and when those big problems do happen, you’ll have an engineering team that’s had more practice at understanding the causes behind a problem and coming up with good action items to make those causes less likely to cause problems in the future.