Backfills should fail in minutes, not days

I’ve waited days for a backfill script to complete, turned on dual reads to confirm that everything was working properly, and then immediately realized that I would need to make a few small fixes before re-running the whole dang thing again. And again. If iteration speed is measured in days or weeks rather than minutes, backfills and data migrations can turn into the slowest and most painful work that a team ever tackles.

And if backfills and data migrations are slow, it creates a self-reinforcing cycle: they’re slow, so the team doesn’t do them, so the team doesn’t get better at them and invest in tooling to make them easier, so the team builds a mental model where re-architecting how and where data is stored is expensive and hard.

I don’t want to say that backfills and data migrations are easy – there are a ton of potential things to get wrong! – but I do believe they can be straightforward:

Validate with dual reads and dual writes. Observability lets you avoid painful and time-consuming rollbacks.
- Validate with dual reads
- Don’t assume writes will work
If data is large, do the migration/backfill on only a subset of your data so that it’s quick to iterate.
- Export a subset of data for validation
- Take it one step at a time
Make sure you have the ability to dynamically configure dual reads, dual writes, and how quickly backfills/migrations process data. This matters a bunch for both iteration speed and avoiding performance problems.
- Be careful with performance

Learn from my suffering.

Note: for the purposes of this post, I’m treating backfills and data migrations interchangeably because most of the engineering concerns are exactly the same.

💬

Jacqueline Xu’s Online Migrations at Scale from Stripe’s blog is worth a read too!

Validate with dual reads

One simple data migration pattern looks something like this:

Use a tool (like AWS DMS) to move data from one database to another.
When the migrator is caught up, stop all writes to the old system.
Once the new database is fully caught up, point all reads to the new system.
Start writing to the new system (and hope that the writes work properly).

This pattern can be workable, particularly if the old and new databases are similar enough to make you confident that reads and writes from the new database will work. It works perfectly until it doesn’t, when the new database can’t handle the load or there’s a read mismatch, and your team needs to scramble to turn everything off again and get your code pointed back to the old database.

Aside from solid tests, one way to mitigate this risk is with dual reads. Once things have caught up and read-replication lag is low, run code that reads from both databases and compares the values to make sure that they’re exactly the same. The code to set up something like this isn’t terribly complicated, but there are a few important things to get right:

Dynamic Configuration: Having some way of turning reads off is crucial. One of the major benefits of dual reads is that it gives you load-testing on the new database for free. Ramping up slowly lets you find missing indexes and performance problems ahead of time.
Compare Everything: Aside from checking whether order, length, and values are the same, it’s also important to compare performance between the new and old systems for each query pattern that you’re hitting. When comparing the return values, think about the best way to surface the information to avoid a diff that’s hard to understand when [A, B, C] != [A, C, B] or when [A, B, C, D] != [B, C, D, E]. You’ll want to be able to categorize the diff into "length mismatch" (with extras/missing logged), "order mismatch", or "value mismatch" (where a row from one database doesn’t exactly match a row from the other). When logging out mismatches, think about the minimal-viable shape that shows the mismatch and be careful of PII.
Rate-limited Comparison Retries: There will be lag between the databases, and the queries will reach each database at different times. Under load, this will create false mismatches between the systems. Delaying and then retrying can reduce (but not eliminate) how often this happens, but these retries have the danger of multiplying the number of queries against a database and overwhelming it. I like an in-memory fixed window retry allowance; diffs from lag should hopefully be rare, so retrying a few per minute won’t meaningfully affect load.
```
let retriedComparisonCount = 0;
const ALLOWED_RETRIES_PER_MINUTE = 10;
function attemptToGetRetryToken () {
    return retriedComparisonCount++ < ALLOWED_RETRIES_PER_MINUTE;
}
setInterval(() => retriedComparisonCount = 0, ONE_MINUTE_IN_MS).unref();
```
Automatic kill-switches: It’s relatively simple to add in-memory per-container kill-switches on errors with the new system. If things are erroring with the new system, the immediate next step will always be turning reads to the new system off temporarily to debug. Before an engineer has a chance to do that, you can look at error-count from the new system to turn it off for that container automatically. (You can share state across containers to do this globally as well, but I think the benefits are marginal).
Return the old value early: Because there can be performance issues with the new database, it may make sense to return the old rows to callers before doing the comparison against the new rows. This is less important if things are well instrumented and you’re on top of turning things off.

If it’s helpful, I put some pseudo-code for what a simple dual-read system might look like in an appendix.

Doing dual-reads doesn’t make a migration or backfill risk-free, but it reduces the risk substantially.

Don’t assume writes will work

If you’re migrating from one database provider to another (say from Aurora to PlanetScale), pointing the exact same writes to the new system will likely work. But many migrations and backfills need to transform the data in some way. Perhaps you’re moving from a non-relational database to a relational one? Or doing a backfill that changes how the data is structured? In those cases, assuming that your code will create the exact same data after the switchover can be risky.

For these cases, I like to control writes to the new system so that I’m testing that my writes work properly before I do the switchover.

Start dual-writing to the new table. For updates, track times when there’s a mismatch in rows updated, but don’t do anything with those mismatch counts until the backfill is done.
Run a backfill that iterates through the old table to insert data into the new table. This backfill must handle rows that already exist in that table that have been inserted since we started the dual-writes (generally by ignoring them). If you have hard deletes, you’ll want a tombstone marker so they don’t get resurrected by the backfill script.
Once the backfill has completed, start dual-reading from both tables to validate that data matches up. It’s often useful to log when updates and deletes touch different numbers of rows in old/new tables.
Once everything matches up, point reads to the new table and stop reading from the old table. At this point, writes still need to go to the old table!
Once reads are pointing to the new table, turn off writes to the old table.

Similar to dual-reads, the ability to turn dual-writes off if something goes terribly awry is essential, and tracking the performance of the new system is important to make sure there aren’t major performance regressions.

This might seem slower than YOLOing writes over to a new system, but when the costs of messing up data in a new system can be high, moving deliberately will be faster over the long run than taking risky bets.

Export a subset of data for validation

Your migrations and backfills might go perfectly the first time.

Mine… don’t.

If you have any appreciable amount of data, that might mean that you need to restart a full migration or backfill and wait several hours (or days!) before you’re able to tell that your migration worked perfectly. I normally like to export a subset of data filtered by the main way that it will be queried so that I can do dual-read comparisons on just that subset. There might be hidden surprises when I go back to export or backfill everything, but it makes iteration/validation much faster. You can feature flag data migrations the same way you feature flag anything else!

Exporting all the data is probably the mistake that I’ve made most often. I’ll fool myself into thinking that this will finally be the time that I do it perfectly the first time, so I don’t need to worry about a partial validation step. A lot of the time, the reason I’m doing a migration or backfill like this is because the data is in a legacy system of some sort, which means there are almost always footguns and lurking data surprises.

💬

Tests that exercise the database layer are a crucial tool as well! I like to run the existing tests in two environments:

Reading/writing to the old database/table (with dual writes enabled)
Running the backfill in a beforeEach block and then turning dual reads on. My comparison code will throw in the test environment if there’s a mismatch between the results from the old database/table.

Take it one step at a time

When you’re backfilling data or moving from one database type to another, it can be tempting to jump all the way to your desired end state: you want data in the new system, in a new format, with different table relationships, so you write your code to create that desired end state. That’s a mistake. Working in this way dramatically increases the number of things that can go wrong, and it can greatly complicate debugging: Is the field that we fetched for our clients incorrect because we have the wrong data stored in the tables? Because our new queries are incorrect? Because we lost something when we did a transformation?

I think about changes like these the same way I think about refactoring code. Most of the time, it’s easier to take small, testable steps towards your destination rather than doing a full rewrite. When something goes wrong, you know it’s a problem with your current step, and you limit the pieces of the system that you need to keep in your head. Martin Fowler’s book on refactoring frames refactoring code as a series of discrete, almost mechanical steps that you can take to transform your codebase rather than a full re-write, and similar ideas apply to how we change the architecture of our systems.

Working in small steps is also more predictable and organizationally legible. If a team is tackling a migration and transformation project with an all-or-nothing approach, it can be hard to judge how far along they are because the substeps might all need to be complete to have things running in a meaningful way, and when the new system is turned on, you might discover performance problems.

In a way, this echoes my advice to only export/migrate a subset of the data first: deliver a subset of the value first so that you can tell that the process is working.

Be careful with performance

Backfills and migrations are both times when you may be writing a substantial amount of data, and it means that you can take down a database if you attempt to write as fast as you can read. Even if the primary database is able to keep up, the replicas might not be able to, especially if there are any differences in the way that the replicas apply writes. If you take the whole site offline by killing a database, that will slow down your migration.

At the same time, there’s also the danger of setting up a migration that will take weeks to complete. These operations need to make changes at a pace that’s safe for the database and that will complete in a reasonable amount of time.

A few quick pieces of advice:

Track how fast you’re going! You want to be able to estimate when your job is going to complete. If it looks like it will take 3 weeks to finish, you’ll want to rework it.
It can be helpful to set up dynamic configuration for batch size, delays between inserts/updates, and parallelization so that you don’t need to restart the job to speed it up or slow it down.
Never use offset for queries because it needs to scan through all of the results, and it will be inaccurate if inserts or deletes sneak in.
In general, iterating through by the primary key will make backfills/migrations simple. I avoid large queries that look for the precise rows to update because those large queries can themselves cause performance problems. Iterating through a table by its primary key also lends itself to easy-to-reuse code.
Track and alert on read-replica lag, and build in a mechanism to add configurable delays to slow down a migration that’s writing faster than the replicas can keep up.¹

Conclusion

In general, the "trick" for migrations and backfills is to pursue the same best-practices that make any other task fast:

Iterate quickly: export a subset of your data, set up dynamic configuration, and assume that there will be mistakes and unforeseen problems.
Create good observability: do dual reads/writes with comparisons and send clear metrics/logging during backfills.

One final piece of advice: be sure to ask why someone wants to migrate data! Sometimes the real goal will be improving performance or enabling a feature, and there might be a better way of accomplishing the same goal without shifting a ton of data around. Not doing a project is even faster than doing one well.

Appendix: dual-read pseudocode

GitHub’s scientist library is a well-known Ruby implementation of this pattern.

async function dualReadAndCompare (queryName, { 
    getOldData,
    getNewData,
    getMatcherKey,
    attemptCount,
}) {

    attemptCount ??= 0;

    if (!DualReadConfiguration.getRollout("dualReadAndCompare", queryName)) {
        return getOldData();
    }

    const start = performance.now();
    const [
        oldData,
        newData,
    ] = await Promise.all([
        getOldData()
            .then((oldData) => {
                metrics.distribution("dual_read_and_compare.duration", performance.now() - start, {
                    queryName,
                    version: "old",
                });
                return oldData;
            }),
        getNewData()
            .then((newData) => {
                metrics.distribution("dual_read_and_compare.duration", performance.now() - start, {
                    queryName,
                    version: "new",
                });
                return newData;
            });
            .catch((err) => {
                logger.serverError("dual_read_compare.new_read_error", err, queryName);
                return [];
            }),
    ]);

    await compareAndLog({ oldData, newData, getOldData, getNewData, attemptCount, queryName, getMatcherKey });
    // for the sake of example simplicity, I'm not returning the old value early
    return oldData;
}

let retriedComparisonCount = 0;
setInterval(() => retriedComparisonCount = 0, ONE_MINUTE_IN_MS).unref();
const ALLOWED_RETRIES_PER_MINUTE = 10;
async function compareAndLog({ oldData, newData, getOldData, getNewData, getMatcherKey, queryName, attemptCount }) {
    // for this pseudocode, we'll assume that oldData and newData are always arrays
    const differenceCategories = [];

    // [A, C, B] will not be the same as [A, B, C, D] in a few different ways.
    // you might want to enumerate all of those ways in your metrics/logs
    // or you might just want to choose the first one and call it good! Honestly, not too much of a difference one way or the other
    if (oldData.length !== newData.length) {
        differenceCategories.push("length");
    }

    const oldKeys = new Set(oldData.map(getMatcherKey));
    const newKeys = new Set(newData.map(getMatcherKey));
    const missingInNew = oldKeys.difference(newKeys);
    if (missingInNew.size) {...

    ...lots more comparison code

    if (differenceCategories.length && attemptCount < 1 && retriedComparisonCount < ALLOWED_RETRIES_PER_MINUTE) {
        retriedComparisonCount+=1
        await new Promise(resolve => setTimeout(resolve, 1000));
        // do the retry here
        // for a real system, I'd refactor things in here a bit! Trying to keep things simple and mostly readable
        return;
    }

    if (differenceCategories.length) {
        logger.log("dual_read_and_compare.mismatch", {
            mismatchReasons: differenceCategories,
            diff,
            exampleMismatchedRows,
        })
        metrics.increment("dual_read_and_compare.comparison", {
            exactMatch: "false",
            mismatchReasons: differenceCategories.join("-"),
            queryName,
        })
    } else {
        metrics.increment("dual_read_and_compare.comparison", {
            exactMatch: "true",
            queryName,
        });
    }
}

One potential source of read replica lag in MySQL is maintaining an index on a last_updated_at column for CDC exports because it can create write contention on the right edge of the b-tree. Sharding the index (i.e., indexing on (pk % N, last_updated_at)) may reduce performance problems associated with a monotonic leading column in an index, but I haven’t actually explored this because it was easier to just slow down exports / add read-replica-lag-aware rate-limiting rather than reworking how CDC exports work. ↩︎