Culling Containers with a Leaky Bucket

ClassDojo occasionally has a few containers get into bad states that they're not able to recover from. This normally happens when a connection for a database gets into a bad state—we've seen this with Redis, MySQL, MongoDB, and RabbitMQ connections. We do our best to fix these problems, but we also want to make it so that our containers have a chance of recovering on their own without manual intervention. We don't want to wake people up at night if we don't need to! Our main strategy to make that happen is having our containers decide whether they should try restarting themselves.

The algorithm we use for this is straightforward: every ten seconds, the container checks if it's seen an excessive number of errors. If it has, it tries to claim a token from our shutdown bucket. If it's able to claim a token, it starts reporting that it's down to our load balancer and container manager (in this case, nomad). Our container manager will take care of shutting down the container and bringing up a new one.

On every container, we keep a record of how many errors we've seen over the past minute. Here's a simplified version of what we're doing:

let recentErrorTimes: number[] = [];
function serverError(...args: things[]) {
  recentErrorTimes.push(Date.now());
}

export function getPastMinuteErrorCount () {
  return recentErrorTimes.count((t) => t >= Date.now() - 60_000);
}

Check out ERROR, WARN, and INFO aren't actionable logging levels for some more details on ClassDojo's approach to logging and counting errors.

After tracking our errors, we can then check whether we've seen an excessive number of errors on an interval. If we've seen an excessive number of errors we'll use a leaky token bucket to decide whether or not we should shut down. Having a leaky token bucket for deciding whether or not we should try to shut down the container is essential: if we don't have that, a widespread issue that's impacting all of our containers would cause ALL of our containers to shut down and we'd bring the entire site down. We only want to cull a container when we're sure that we're leaving enough other containers to handle the load. For us, that means we're comfortable letting up to 10 containers shut themselves down without any manual intervention. After that point, something is going seriously wrong, and we'll want an engineer in the loop.

let isUp = true;
const EXCESSIVE_ERROR_COUNT = 5;
const delay = (ms: Number) => new Promise((resolve) => setTimeout(resolve, ms));

export async function check () {
  if (!isUp) return;
  if (getPastMinuteErrorCount() >= EXCESSIVE_ERROR_COUNT && await canHaveShutdownToken()) {
    isUp = false;
    return;
  }

  await delay(10_000);
  check();
}

export function getIsUp () {
  return isUp;
}

At this point, we can use getIsUp to start reporting that we're down to our load balancer and to our container manager. We'll go through our regular graceful server shutdown logic and when our container manager brings up a new container, starting from scratch should make us likely to avoid whatever issue caused the problem in the first place.

router.get("/api/haproxy", () => {
  if (getIsUp()) return 200;
  return 400;
});

We use redis for our leaky token bucket. If something goes wrong with the connection to that redis database, our culling algorithm won't work, and we're OK with that. We don't need our algorithm to be perfect—we just want it to be good enough to increase the chance that a container is able to recover from a problem on its own.

For our leaky token bucket, we decided to do the bare minimum: we wanted to have something simple to understand and test. For our use case, it's OK to have the leaky token bucket fully refill every ten minutes.

/**
 * returns errorWatcher:0, errorWatcher:1,... errorWatcher:5
 * based on the current minute past the hour
 */
export function makeKey(now: Date) {
  const minutes = Math.floor(now.getMinutes() / 10);
  return `errorWatcher:${minutes}`;
}

const TEN_MINUTES_IN_SECONDS = 10 * 60;
const BUCKET_CAPACITY = 10;
export async function canHaveShutdownToken(now = new Date()): Promise<boolean> {
  const key = makeKey(now);
  const multi = client.multi();
  multi.incr(key);
  multi.expire(key, TEN_MINUTES_IN_SECONDS);
  try {
    const results = await multi.execAsync<[number, number]>();
    return results[0] <= BUCKET_CAPACITY;
  } catch (err) {
    // if we fail here, we want to know about it
    // but we don't want our error watcher to cause more errors
    sampleLog("errorWatcher.token_fetch_error", err);
    return false;
  }
}

See Even better rate-limiting for a description of how to set up a leaky token bucket that incorporates data from the previous time period to avoid sharp discontinuities between time periods.

Our container culling code has been running in production for several months now, and has been working quite well! Over the past two weeks, it successfully shut down 14 containers that weren't going to be able to recover on their own and saved a few engineers from needing to do any manual interventions. The one drawback has been that it makes it easier to ignore some of these issues causing these containers to get into these bad states to begin with, but it's a tradeoff we're happy to make.