Store Every Request

Storing details about every request that hits your backend is useful for debugging and analysis. Your company probably already has a system like this, but if you don’t, I hope these notes are helpful.

Say you have a site that does 10,000 requests / second and you want to store details about every single one. How much will it cost to ingest and store a year’s worth of requests?

Assuming you’re on AWS and you want to do this the plug-and-play way without building any custom infra, it’s decently expensive: I’d estimate ~$10k/month for firehose ingestion + ~$10k/year for storage and queries.1 That’s a lot, but depending on what the 10,000 requests / second actually are, it might seem still seem cheap for the value.

But our system doesn’t need to be that expensive. Firehose ingestion is the most expensive part of this, so if you work around its minimum-record-size-is-5kb and we-charge-extortionate-rates-by-the-kb pricing model or self-host something like fluentbit, you can build a much cheaper system that’s closer to the $15k/year (or less) end of things.

We also don’t necessarily need to store every request to build a valuable system. Perhaps it makes sense to focus on the 1/10 requests that mutates data or some other interesting subset of requests. Even limiting yourself to only 4xx and 5xx errors may still be useful… I’d rather have everything.

Columns

I don’t want to focus on how to build the infrastructure for a system like this because it will depend so much on your scale, constraints, and expertise. Instead, I want to chat through columns that you might want to include if you do decide to build a system to store a bunch of requests because experience has taught me a few painful lessons.

I think some columns are relatively straightforward: http_method, full_url, user_id, started_at, client_platform, client_app_version, backend_app_version. These give us the basics:

Who: user_id2
What: http_method, full_url
Where: client_platform, client_app_version, backend_app_version
When: started_at

Let’s chat through some of the more interesting columns.

Compression

If we’re storing 10,000 requests per second or more, that data is going to add up pretty quickly. We’ll want to put some serious thought into how it’s encoded. Aside from reducing storage cost, encoding also affects how quick it will be to query: it takes less time to scan fewer bytes.

Thankfully, this data compresses quite well. I’m going to use Redshift encodings as an example, but the same properties mean that encodings in other systems (like Parquet) will be able to substantially compress things:

How well your data compresses is going to depend on exactly what you’re storing, but for these example columns, I’d estimate under 100 bytes per row.4 Going back to our example backend that servces 10,000 reqs/second gives us ~30 terabytes of data a year, but you certainly don’t need to retain this data for a year or more if it’s not worth the storage cost (and things like S3 Glacier make long-term storage pretty darn cheap).

Real-world Experience

ClassDojo had this sort of request-logging system for the entire time I worked there, and it was so helpful for debugging and analyses that I can’t picture not having something like it available. Data analysts loved it because it meant that there was always a fallback option when client logging wasn’t sufficient. Security folks loved it because it let them inspect a malicious session’s activity. Customer Experience folks loved it because it was a key piece in tracing back which user made any particular change in our system. And engineers loved it because it helped to debug problems, and it could sometimes even help with things like generating the list of users affected by a particular bug so that we could run more targeted backfills.


  1. This will scale based off of how many years of data you want to store and how often you’re querying it. This is super rough back-of-the-envelope math meant to give an order-of-magnitude estimate for a system like this. ↩︎

  2. Caution is warranted anytime you’re storing anything that can tie back to a user. Depending on your data processing requirements, you may need to add a level of indirection here or store this in a separate table keyed by request_uuid with different retention and access rules. ↩︎

  3. My experience has been that you can’t trust client clocks. They’re mostly accurate, but if you’re trying to rely on them, you’ll likely want some amount of client-side code to take timestamps from the server to adjust the times that the client then reports to the backend. Time deltas from the client over small timespans are generally trustworthy as well, but you may need to throw away some extreme values.

    There might be better ways of getting client time! I’m sure there are good libraries out there that solve this problem well. ↩︎

  4. I’d guesstimate under 100 bytes per row. Assuming the average route is something like /api/groupId/:groupId/thingId/:thingId (under 50 characters long), if route + url is encoded with zstd, the non parametrized parts of it should shrink it down substantially. Most of the other fields are things that can be encoded in only a byte or two with bytedicts or zstd. ↩︎