Using NodeJS's AsyncLocalStorage to Instrument a Webserver

AsyncLocalStorage Makes the Commons Legible talks a bit about how important this style of instrumentation is in avoiding the performance tragedy of the commons.

In the dark times before AsyncLocalStorage, it could be hard to tell why a request would occasionally time out. Were there multiple relatively slow queries somewhere in that route's code? Was another request on the same container saturating a database connection pool? Did another request block the event loop? It was possible to use tracing, profiling, and logs to track down problems like these, but it could be tricky; setting up per route metrics using AsyncLocalStorage makes it a ton easier!

When ClassDojo set up our AsyncLocalStorage-based per-route instrumentation we found things like:

I wrote about this a bit more in AsyncLocalStorage Makes the Commons Legible. If you're not familiar with ClassDojo, it's a parent-teacher communication platform. Our monolithic NodeJS web-server backend API normally serves ~10,000 requests per second.

I'd like to go through some of the details of how we set up this per-request instrumentation. For this post, we'll be starting with a relatively standard NodeJS web-server with pre-router middleware, and a router that finds an appropriate route to handle a request. It should look something like this:

app.use(({ req, res }, next) => {
  const start = Date.now();
  onFinished(res, () => afterResponse(req, res, start));
  next();
});
app.use(rateLimitingMiddleware);
app.use(bodyParserMiddleware);
app.use(allOfTheRestOfOurMiddlware);

app.use(router);
app.use(notFoundMiddleware);

To add instrumentation to this setup, we'll want to do the following:

  1. Create a per-request async store in our first middleware
  2. Store details about the database request caused by our request in our request's async store
  3. Send the request's database request details to our data lake.
  4. If any of the database request details violate our per-request limits, we log it as a server-error so that a team can see it & take action

Starting our pre-request async store

In a NodeJS web server, each middleware calls the next, so if we start an async local storage context in our very first middleware, every subsequent middleware should have access to the same storage context. (I had a lot of trouble understanding why this worked, so I wrote up a simplified gist that hopefully demonstrates what's going on.)

import { AsyncLocalStorage } from "async_hooks";

export const requestTrackingAsyncLocalStore = new AsyncLocalStorage();

// requestTrackingAsyncLocalStoreMiddleware wraps the downstream koa middlewares inside an async local storage context
export function requestTrackingAsyncLocalStoreMiddleware({ req, res }, next) {
  const store = {
    requestCost,
    req,
    res,
  };
  // running the next middleware in the chain in the context of this 'run' makes sure that all calls
  // to getStore() in the scope of this requests are bound to the correct store instance
  return requestTrackingAsyncLocalStore.run(store, next);
}

// add this to the router! (this would be in a different file)
app.use(requestTrackingAsyncLocalStoreMiddleware);
app.use(rateLimitingMiddleware);
app.use(....);

Store details about request about behavior in our pre-request async store

Now that we have a pre-request async local store, we can grab it and start using it! We'll want to learn:

  1. How many database requests do we make over the course of an HTTP request? Are we running into the N+1 query problem on any of our routes?
  2. How long do those database requests take in total? Requests that take a long time can indicate spots where we're doing a lot of expensive work.
  3. How many documents are these requests returning? If we're processing 10,000s of documents in NodeJS, that can slow down a server quite a bit, and we may want to move that work to our database instead.
export function increment(type: "request_count" | "duration" | "document_count", table: string, n: number = 1) {
  const store = requestTrackingAsyncLocalStore.getStore();
   // we'll probably want to track this to see if we're losing async context over the course of a request
  if (!store) return;
  _.set(store, ["requestCost", type], _.get(store, ["requestCost", type], 0) + n);
  _.set(store, ["requestCost", "byTable", table, type], _.get(store, ["requestCost", "byTable", table,], 0) + n);
}

If we add code that wraps our database client's request, it should hopefully be easy to add these increment calls at an appropriate point.

Handle the request report

Once we have this request report, we can do whatever we'd like with it! At ClassDojo, we log a server-error whenever a route is doing anything particularly egregious: that way, we get quick feedback when we've made a mistake. We also use a firehose to send this data to redshift (our data lake) so that we can easily query it. Either way, this is something that we can do after we're done sending our response to the client:

app.use(requestTrackingAsyncLocalStoreMiddleware);
app.use(({ req, res }, next) => {
  // this use of new AsyncResource will preserve the async context
  res.on("finished", new AsyncResource("requestTrackingLogging").bind(() => {
      const store = requestTrackingAsyncLocalStore.getStore();
      if (!store) throw new Error(`Something has gone awry with our async tracking!`);
      if (isEgregiouslyBad(store.requestCost)) logOutBadRequest(store);
      requestCostFirehose.write(store);
  }))
  next();
});

Tracking down places where we lose async context

While the async store might feel like magic, it's not, and sommon common situations will cause you to lose async context:

  1. using callbacks rather than promises. In those situations, you'll need to create an AsyncResource to bind the current async context
setTimeout(new AsyncResource("timeout").bind(() => doRequestTrackingThings()), 1);
redisClient.get("key", new AsyncResource("timeout").bind(() => doRequestTrackingThings()))
  1. Some promise libraries might not support async-hooks. Bluebird does, but requires setting asyncHooks to true: Bluebird.config({ asyncHooks: true });.

It may take a bit of work to track down and fix all of the places where you're losing async context. Setting up your increment calls to log out details about those situations can help!

export function increment(type: "request_count" | "duration" | "document_count", table: string, n: number = 1) {
  const store = requestTrackingAsyncLocalStore.getStore();
  if (!store) {
    logServerError(`We lack async context for a call to increment ${type} ${table} by ${n}`, new Error().stack);
    return;
  }
  ...
}

Increased Observability is great!

Putting effort into increasing the observability of a system can make that system much easier to manage. For a NodeJS web-server, we've found a lot of benefits in using AsyncLocalStorage to improve per-request visibility: it has let us improve latency on a few routes, reduced our event-loop blocking, and given us a better view of opportunities to improve performance.