Ratcheting SQL Queries: reducing a database’s Load by 1/4th with a simple change

🐎

You have 25 horses in your stables, and you want to find the 3 fastest. You can only race 5 horses at a time. How many races do you need to run?

SQL query planners are fantastic. They’re able to turn almost any combination of JOINs, ORDER BYs, GROUP BYs, and subqueries into a plan to efficiently look up data. They create these plans based on simple heuristics like table size, index cardinality, and expected matching rows.

Most of the time, their plans work well. Occasionally, a query planner will make a poor assumption about the order that it should do joins or which index it should use for part of a query. In those cases, index hints can point the planner in a better direction.¹ In other cases, rewriting the query can give the database the clues it needs to come up with a more optimal plan. You know much more about your application’s data and the tradeoffs that you’re willing to make than the query planner does, so you’ll always be able to come up with a query plan that is as good or better than the one that the query planner comes up with.²

I was doing some database optimization work recently, and the first thing I noticed is that the most common query the database was running was to check whether a single class_post existed for a user to determine whether or not to show part of the UI:

// DANGEROUS CODE ALERT. ONLY FOR ILLUSTRATION
const results = await Promise.all(classIds.map((classId) => {
  return runQuery(`SELECT 1 FROM class_post WHERE classId = ? LIMIT 1`, [classId]);
});
const hasAnyEntries = results.some((rows) => rows.length > 0);
return hasAnyEntries;

❗

The actual code used something similar to Bluebird’s Promise.map to limit concurrency and avoid saturating the server’s database connection pool. As written, this code is pretty dangerous! Even with concurrency control, any N + 1 query pattern can easily turn into a performance problem.

The first thing you might wonder about is the Promise.all. Why wasn’t it a single WHERE classId IN (1, 2, 3, 4, 5) query? In practice, the query plan that the database came up with was far slower and caused more load than doing a query per classId.³

How can we do better than the MySQL query planner? There are four key facts we can take advantage of to reduce our load:

We only need one class’s posts to satisfy this check.
Most classes satisfy this check. We can early exit most of the time!
Newer classes are more likely than older classes to meet the match criteria because user behavior changed over time.³
We’re okay with slightly increased latency if it reduces load substantially

We simply iterate serially through our IDs and bail as soon as we find a single match.

const sortedClassIds = _.orderBy(classIds, _.identity, ["desc"]); // this orders by creation date
for (const classId of sortedClassIds) {
  const result = await runQuery(
    `SELECT 1 FROM class_post WHERE classId = ? LIMIT 1`,
    [classId],
  );
  if (result.length) {
    return true;
  }
}
return false;

This is the best case scenario for this technique. The original code was totally unoptimized and was being hit all the time. Even a concurrency-controlled Promise.any-like utility without any sorting would have had a massive impact on performance. A simple cache storing whether any classId had a post would also have worked well. You hopefully don’t have any code that is this unoptimized that is adding unnecessary database load!

I wish I’d taken a picture of the sharp downwards line on all of the database charts when this change went in. Peak load went from 17.2 active sessions (on 16 vCPUs) to 12.4 active sessions with the same amount of load. User-facing performance improved as well, despite that not being a goal of this work.

Let’s apply the same ideas to a more complicated situation. Picture if you will the world of 2006. We’re setting up a user-specific feed without any fancy machine-learning or advertisements or anything like that. Just posts from your friends in the order that they posted them in. We haven’t set up a per-user feed yet, and our query to find all posts from all of our friends is regularly taking the database offline because we didn’t expect people to have more than 30 friends. Nobody on our team had more than 3 friends, so the performance characteristics were a shock.

SELECT *
FROM friend
JOIN post ON post.posterId = friend.toFriend
WHERE fromFriend = ?
ORDER BY post.createdAt DESC
LIMIT 30

💡

As written, this query is likely to scan all posts for all of your friends rather than just the 20 most recent posts for each of your friends. The standard technique to avoid scanning every post for every friend is to rewrite it to be a series of UNIONed queries that are then sorted.

We can apply the same ideas to ratchet down the set of posts we need to look at and come up with a smarter way to get the same data. Imagine that we know the most recent time that any of your friends posted: if someone hasn’t posted in 2 years and we already found the 20 posts we want from the past week, we don’t need to check that friend’s post.

const sortedFriends = _.orderBy(friends, (friend) => friend.lastPosted, [
  "desc",
]);
const LIMIT = 20;
let results = [];
for (const friend of sortedFriends) {
  // any post that's after the current friend's last post will always be returned to a user
  // if we were streaming responses to clients, we could already start to return these results
  const winningPosts = results.filter(
    (post) => post.postedAt > friend.lastPosted,
  );

  // If we have a full array of posts, we can already return
  // alternative: if (results.at(-1).postedAt > friend.lastPosted && results.length === 20)
  if (winningPosts.length === LIMIT) {
    return results;
  }

  let after = new Date(0); // time started in 1970

  // if we have 20 posts, any new post we find must beat the last post's postedAt to be included in our final results
  if (results.length === LIMIT) {
    after = new Date(results.at(-1).postedAt);
  }

  // if we have 11 "winning" posts that we know can't be beaten by this friend
  // we only need to look at their most recent 9 posts
  const ratchetedLimit = LIMIT - winningPosts.length;

  const posts = await Posts.findByPoster(friend.friendId, {
    limit: ratchedLimit,
    after,
  });
  results = merge(results, posts, (post) => post.postedAt); // these arrays are already sorted, so no need to sort them
}
return results;

This is a relatively unsophisticated implementation. We will almost certainly want to parallelize some of these lookups, and these queries will still be expensive for us even with this ratcheting behavior. Without the ratchet, the database needs to look up 50 friends * 20 posts. With the ratchet, it might look up 20 posts from your first friend, 18 from the second, 14 for the third, and not need to do lookups for the last 25 of your friends. This is still an N + 1 query problem, but it’s a less painful one.

I’ve tried this sort of ratcheting query pattern a few times, and it’s only been worth it some of the time. The increased query efficiency for the database may be offset by the extra round trips, so depending on the actual characteristics of your data and the tradeoffs you want to make, this might make your performance worse for little to no database benefit.

Or it might reduce your database’s total load by a 1/4th after a simple coding change.

If you’re running on Postgres, pg_hint_plan looks like it can extend Postgres with query hint support. Most of my experience has been with MySQL and MongoDB which don’t need query plans very often, but when I’ve needed them, they’ve been invaluable. MySQL will sometimes choose pretty degenerate query plans, and the ability to give hints to the planner has made queries 1,000x more performant. ↩︎
Before the era of modern SQL and NoSQL databases, software engineers would need to explicitly script these lookups. Things are far easier nowadays.

On the other schema, I do wish databases made it possible to write query plans for the very small list of queries that would benefit from precise instructions and slightly more complex logic. The query planner is good enough 99% of the time, but I’d like more control for that other 1% because I might want to make different performance tradeoffs. ↩︎

From testing locally, it looks like two items in a WHERE x in (1, 2) clause will cause the query planner to use an (x = 1) OR (x = 2) query plan, but three items will change to a less efficient scanning plan.

explain analyze select * from widgets where gid in (1) limit 1;
-> Limit: 1 row(s)  (cost=3588 rows=1) (actual time=1.03..1.03 rows=1 loops=1)
  -> Covering index lookup on widgets using gid_id (gid=1)  (cost=3588 rows=35740) (actual time=1..1 rows=1 loops=1)

mysql> explain analyze select * from widgets where gid in (1, 2) limit 1;
-> Limit: 1 row(s)  (cost=13870 rows=1) (actual time=1.15..1.15 rows=1 loops=1)
  -> Filter: (widgets.gid in (1,2))  (cost=13870 rows=69214) (actual time=1.14..1.14 rows=1 loops=1)
      -> Covering index range scan on widgets using gid_id over (gid = 1) OR (gid = 2)  (cost=13870 rows=69214) (actual time=1.12..1.12 rows=1 loops=1)

mysql> explain analyze select * from widgets where gid in (1, 2, 3) limit 1;
-> Limit: 1 row(s)  (cost=10103 rows=1) (actual time=7.85..7.85 rows=1 loops=1)
  -> Filter: (widgets.gid in (1,2,3))  (cost=10103 rows=100464) (actual time=7.85..7.85 rows=1 loops=1)
      -> Covering index scan on widgets using gid_id  (cost=10103 rows=100464) (actual time=1.55..5.95 rows=20001 loops=1)

I’m sure the actual cutoff varies! But I was surprised to see MySQL’s cost-estimates were far enough off that it chose to scan 20,001 rows rather than just grabbing the single one I needed.

My local tests are super unrealistic, so make sure you’re testing on your actual data! ↩︎ ↩︎