You’re already running canary servers

If a deploy caused a spike in 5xx errors, you’d cancel it before it completed. If you did a blue-green rollout and the database was suddenly pegged, you’d roll back. If you pushed a change to a website and started getting tons of reports of user crashes, you’d undo that change as quickly as you could.

In all of these cases, you’re validating your build with production traffic, but in an ad-hoc way. Your success criteria are unclear, and the impact to users is determined by how quickly you’re able to manually roll back. I’d argue that in a real sense these are all implicitly canary deploys, deploys that validate code against production traffic, and that it’s worth being rigorous about your validation and rollback stages.

There are two key pieces to any canary setup: validation and rollback. Validation is the more interesting piece; how do you tell that your new code is good? You have 100 different metrics that all vary over the course of the day, so how do you choose the right ones to focus on and figure out appropriate cutoffs?

My experience has been that a lot of metrics are too noisy to be useful for automated validation. If you have a strict limit on a lot of different metrics, then you run into a ton of false positives. If you have a loose limit, then you don’t catch some of the things that you wanted to catch. Putting a few loose limits in place is helpful - you probably want to stop a deploy if it takes 2x the CPU or 2x the time that it used to - but anomaly detection is a tricky problem unless you have incredibly good control over the exact number (and not percentage) of requests that are making it to your canary containers.

For backend deploys, the types of validation criteria I like are more concrete: 50,000 successful requests without a single server error. 50,000 requests doesn’t guarantee that you’ll discover every possible regression, but it gives you reasonable certainty that things are working as expected. Depending on your service, you might want to see many more or many fewer successful requests before calling a canary validated.

This depends on having a clear definition of a server error: things that are mistakes on the server that must be fixed there. Most server errors probably cause 500s, but some of them may be things that a server can recover from but still should be fixed. If you don’t want to fix something, then it’s not a server error. Correctly categorizing errors is what allows the validation step to work. If you’re not doing that, it will be much harder to validate whether a deployment is good or not.¹

This zero-tolerance policy for errors might seem a bit extreme, and it’s not necessarily where you’ll want to start, but basing your criteria on unambiguous counts like this can make validating a build much easier. There are other counts you might look at to create cutoffs:

How many requests took more than 1s?
How many database requests took more than 100ms?
How many client crashes were reported that included the build number that we were testing out?
How many times did a GC pause exceed 250ms?

It’s easier to create a bright line rule when you have simple counts rather than wonder whether the 2.4% increase in CPU utilization is noise or an actual performance regression.

The trickiest part of a system like this is cultural, not technical. You need an engineering culture of shared ownership where if someone’s build fails the canary stage but it’s not their “fault,” they still will try to fix it. If you don’t have that culture, it’s easy for the canary step to become flaky when people retry deploys without fixing whatever issue called the validation stage to fail to begin with. You also need an engineering culture that prioritizes those sorts of fixes in general, and one where engineers have enough autonomy and trust that they can just go fix things that failed the build.

Canary vs. Blue/Green Deploys

One of the benefits that I’ve seen people point to for canary deploys over blue/green deploys is that because you’re gradually ramping up traffic, you’ll be able to detect load problems. In practice, I’ve found that this is less effective than you’d expect: many queries that aren’t a problem when run on 1% of your servers can suddenly turn into huge problems when run on 100%. You might not hit any deadlocks until you have enough servers running the same problematic code. You might not hit temporary table limits on the database until you have enough servers running queries that create temporary tables.

In terms of validation and ability to roll back, I don’t think there’s much real difference between canary and blue/green strategies. If you want to see N number of requests before validating a build, sending 100% of traffic to the new servers will let you validate your code much faster than sending 1%. Both strategies make it easy to roll back, either by stopping the canary servers or by going back to the other set of servers.

Is it wrong to test out code on users?

When people discuss canary deploys, they’ll sometimes mention that it seems wrong to test code on real users, and I think that’s the wrong way to think about it. The differences between having and lacking an explicit canary setup is how rigorous we are with our validation stage and how easily it will be to roll back. Rather than a canary stage being “wrong” because it’s testing code on real users, I think it’s wrong that so many companies deploy without going through a validation stage of some sort and without a plan in place to quickly roll back code. It means you’re testing out potentially problematic code on 100% of new users without a swift rollback strategy in place.

This sort of canary system is normally associated with engineering cultures that practice continuous delivery, but I think the same lessons are even more applicable to teams or projects that deploy less often. If you’re deploying less often, each deploy has many more changes, so your ability to validate that the huge set of changes isn’t causing problems is essential.

I love my canary containers

In coal mines, some miners would apparently keep their canaries in special cages with little oxygen tanks that they could use to revive the birds after they alerted the miners to dangerous gases.² I imagine that you would care deeply about the little creature that might have saved your life multiple times.

To a much lesser extent, that’s how I feel about canary containers. They have saved me so much time over the years by catching mistakes and times when my local environment doesn’t perfectly match production.

A few days ago, when I started writing this post, I was reflecting that it had been months since canary containers had caught anything for me personally. Yesterday, I pushed some code to add compression to MongoDB network connections for a database. This was code that only broke in production, and the canary containers worked perfectly to stop the deploy before any users saw any issues at all.

They’re wonderful little containers.