Store Every Request

Storing details about every request that hits your backend is useful for debugging and analysis. Your company probably already has a system like this, but if you don’t, I hope these notes are helpful.

Say you have a site that does 10,000 requests / second and you want to store details about every single one. How much will it cost to ingest and store a year’s worth of requests?

Assuming you’re on AWS and you want to do this the plug-and-play way without building any custom infra, it’s decently expensive: I’d estimate ~$10k/month for firehose ingestion + ~$10k/year for storage and queries.¹ That’s a lot, but depending on what the 10,000 requests / second actually are, it might seem still seem cheap for the value.

But our system doesn’t need to be that expensive. Firehose ingestion is the most expensive part of this, so if you work around its minimum-record-size-is-5kb and we-charge-extortionate-rates-by-the-kb pricing model or self-host something like fluentbit, you can build a much cheaper system that’s closer to the $15k/year (or less) end of things.

We also don’t necessarily need to store every request to build a valuable system. Perhaps it makes sense to focus on the 1/10 requests that mutates data or some other interesting subset of requests. Even limiting yourself to only 4xx and 5xx errors may still be useful… I’d rather have everything.

Columns

I don’t want to focus on how to build the infrastructure for a system like this because it will depend so much on your scale, constraints, and expertise. Instead, I want to chat through columns that you might want to include if you do decide to build a system to store a bunch of requests because experience has taught me a few painful lessons.

I think some columns are relatively straightforward: http_method, full_url, user_id, started_at, client_platform, client_app_version, backend_app_version. These give us the basics:


Who:	`user_id`²
What:	`http_method`, `full_url`
Where:	`client_platform`, `client_app_version`, `backend_app_version`
When:	`started_at`

Let’s chat through some of the more interesting columns.

route: When doing analyses, it will be more ergonomic to search for route = '/api/thing/:thingId rather than url ILIKE '/api/thingId/%' AND url NOT ILIKE '/api/thingId/%/%'
query_params and url_params: in a similar vein to route, you might want to extra the query and url parameters to your route to make them easier to query. In practice, I don’t end up using these columns very often.
request_uuid: Having a unique identifier per request lets you easily match up logs from your clients and your backend, and even lets you do things like storing the request_uuid of the request that last modified any row in any table.
date_hour or date: If you’re storing this data in an Athena table or something similar, you’ll need a partition key. Partitioning data by YYYY/MM/DD/HH is a reasonable choice.

Similarly, if you’re in a more traditional columnar database, you’ll likely need a sort key for your cluster like (date, route). One common indexing mistake I see in both OLAP and OLTP databases is putting too-precise timestamps in an index that makes anything that follows that too-precise timestamp useless for sorting.
response_error_code: Many routes might return a 400, 403, 404 or even a 429 for multiple reasons, so it can often be helpful to return an “error code” like SCHEMA_ERROR or INVALID_EMAIL to clients as well. Not only is this required for good code on the client, it also gives the backend a good way to start representing client errors it expects (like a POST to /api/login with a mistyped password) and ones it doesn’t (like a POST to /api/login without a password field included).
duration_milliseconds: If you think engineers will want to use this to debug backend performance data storing duration_milliseconds might be useful. Client round-trip time would be even more useful if it’s possible to get, but that starts getting tricky.³
ip_address: IP can be useful when investigating security incidents, but that immediately opens up data retention and access issues. In many jurisdictions (including the EU, UK, and Canada), an IP address by itself can be considered personally identifiable information, so you will likely need to store it in a separate table keyed by request_uuid and have clear lifecycle and access rules for it. (Note: as soon as you start adding extra tables like this, costs will explode a bit if you’re using Firehoses or anything similar that has a high cost per record processed)
request_body: Similar to IP, storing the body of the request might be tempting. With it, you can fully replay a day’s worth of requests, which might be super useful for debugging investigations and load testing, but this is even more fraught than ip_address in terms of data retention, access rules, and even storage size. I wouldn’t do it.
returned_ids: If you have audit requirements about who saw what item when, storing the IDs returned by a route can be doable, assuming that routes aren’t regularly returning 1,000s of items.
schema_shape: Some routes support multiple distinct schema shapes. Knowing which “kind” of request something was may be helpful.
user_type: If your site supports multiple different user types, you’ll probably want to store the type of user on the request. Once encoded, this only adds a single byte per row, so saving yourself from extra joins back to your user/other tables will pay for itself.
domain, browser, browser_version, and more: there are a lot more potential fields that you might want to add, but most of them will depend on what you’re planning on using a table like this one for.

Compression

If we’re storing 10,000 requests per second or more, that data is going to add up pretty quickly. We’ll want to put some serious thought into how it’s encoded. Aside from reducing storage cost, encoding also affects how quick it will be to query: it takes less time to scan fewer bytes.

Thankfully, this data compresses quite well. I’m going to use Redshift encodings as an example, but the same properties mean that encodings in other systems (like Parquet) will be able to substantially compress things:

Many fields, like http_method, client, client_app_version, http_status_code, and route can be likely stored in a bytedict and only take up a single byte per row per block. (If you’re coming from an OLTP background, think of bytedicts as automatically set up enums)
Rows will likely be sorted by time, so you can use delta encodings for started_at.

For varchar columns like url, zstd encodes data per block (group of records), so its per-block compression dictionary should be quite effective. (I’d be curious to test out a column encoded with zstd vs. one encoded with a bytedict to see whether there’s any practical difference for low cardinality fields)

create table request (
    http_method varchar(10) encode bytedict,
    url varchar(1024) encode zstd,
    user_id INT,
    user_type varchar(20) encode bytedict,
    started_at timestamp encode delta,
    date_hour char(13) encode raw, -- the encoding here will be system-dependent
    client varchar(128) encode bytedict,
    client_app_version varchar(128) encode bytedict,
    backend_version varchar(128) encode bytedict,
    route varchar(1024) encode bytedict/zstd/raw, -- this one will depend on whether it's used in the table's sort key and/or how many distinct routes you have
    http_status_code smallint encode bytedict,
    response_error_code varchar(128) encode bytedict,
    duration_milliseconds mostly16 encode az64
);

How well your data compresses is going to depend on exactly what you’re storing, but for these example columns, I’d estimate under 100 bytes per row.⁴ Going back to our example backend that servces 10,000 reqs/second gives us ~30 terabytes of data a year, but you certainly don’t need to retain this data for a year or more if it’s not worth the storage cost (and things like S3 Glacier make long-term storage pretty darn cheap).

Real-world Experience

ClassDojo had this sort of request-logging system for the entire time I worked there, and it was so helpful for debugging and analyses that I can’t picture not having something like it available. Data analysts loved it because it meant that there was always a fallback option when client logging wasn’t sufficient. Security folks loved it because it let them inspect a malicious session’s activity. Customer Experience folks loved it because it was a key piece in tracing back which user made any particular change in our system. And engineers loved it because it helped to debug problems, and it could sometimes even help with things like generating the list of users affected by a particular bug so that we could run more targeted backfills.

This will scale based off of how many years of data you want to store and how often you’re querying it. This is super rough back-of-the-envelope math meant to give an order-of-magnitude estimate for a system like this. ↩︎
Caution is warranted anytime you’re storing anything that can tie back to a user. Depending on your data processing requirements, you may need to add a level of indirection here or store this in a separate table keyed by request_uuid with different retention and access rules. ↩︎
My experience has been that you can’t trust client clocks. They’re mostly accurate, but if you’re trying to rely on them, you’ll likely want some amount of client-side code to take timestamps from the server to adjust the times that the client then reports to the backend. Time deltas from the client over small timespans are generally trustworthy as well, but you may need to throw away some extreme values.

There might be better ways of getting client time! I’m sure there are good libraries out there that solve this problem well. ↩︎
I’d guesstimate under 100 bytes per row. Assuming the average route is something like /api/groupId/:groupId/thingId/:thingId (under 50 characters long), if route + url is encoded with zstd, the non parametrized parts of it should shrink it down substantially. Most of the other fields are things that can be encoded in only a byte or two with bytedicts or zstd. ↩︎