Indexes are sorted arrays (sort of)

Most indexes for databases like PostgreSQL, MySQL, and MongoDB are B-trees or B+-trees. These indexes support O(log(n)) look-ups because B-trees are a self-balancing binary search tree. Sorted arrays can partially represent a binary search tree, so thinking about indexes in terms of a sorted array can help you think about performance and index optimization. There's nothing magic about indexes, and they're easier to grok if you think about them in terms of what you already know!

Let's start off with drawings that students post to a class portfolio—hopefully they're full of stick figures, dinosaurs, sparkles, and spaceships. Each of these drawings will be associated with a student and a class, and have a publication date.

type Drawing = {
    "drawingId": number, // the drawing's unique identifier within the system
    "studentId": number, // the student who made the drawing
    "classId": number, // the class to which this drawing belongs
    "publicationDate": Date, // when this was published to the student's portfolio
    "content": Blob, // however we chose to represent the drawing within our system. Maybe this is a url that links to a file on S3 instead!
}
const drawings: Drawing[] = [
    { drawingId: 0, studentId: 10, classId: 3, publicationDate: new Date("2023-01-05"), .... }, // a t-rex eating a car
    { drawingId: 1, studentId: 8, classId: 3, publicationDate: new Date("2022-12-25"), ... }, // princess unicorn trying on a sombrero
    ....,
    { drawingId: 10987, studentId: 257,...}, // a pink and purple house
    ....,
];

If we order our drawing in the array by their drawingId, it makes looking up a single drawing simple—we can do a binary search on the array to find our dinosaur-y masterwork. Similarly, we could search for ranges of drawings between a starting drawingId and ending drawingId. But if we wanted to search for all drawings for a single class, we'd need to scan the entire table—drawings.filter(({ classId }) => classId === searchClassId). Because our drawingIds are assigned sequentially, the drawings we find for the class will be sorted after we've filtered for them, but it will become an expensive query as our array fills up with artwork.[1]

Our array will eventually get too large for O(n) scans to return artwork quickly for our students. To solve this, we add an index—which is just another sorted array! This sorted array contains objects with two fields, classId and drawingId, and it's sorted by classId. When someone wants to query for drawings for a class (select * from drawings where classId = X), we now have an efficient way of returning those drawings. First, we look up the drawingIds that we want to fetch by binary-searching our sorted classIndex array. We then go back to our initial drawings array and use a binary search to find all of the drawings that our index told us about. Pretty slick! One thing to note is that we always store the primary key in any index. The SQL-equivalent of this would be create index class_index on drawing (classId)—the data-structure that that command creates includes drawingId!

type ClassIndex = Array<{
  classId: number;
  drawingId: number;
}>;

const classIndex: ClassIndex = drawings
  .map(({ classId, drawingId }) => ({ classId, drawingId }))
  .sort((a, b) => b.classId - a.classId);

function findDrawingsByClassId(classId: number) {
  const start = binarySearch(classIndex, classId); // code for a rough binary search in addendum

  const drawingIds = [];
  for (let i = start; i >= 0 && classIndex[i].classId === classId; i--) {
    drawingIds.push(classIndex[i].drawingId);
  }
  drawingIds.reverse();
  for (
    let i = start + 1;
    i < threadIndex.length && threadIndex[i].classId === classId;
    i++
  ) {
    drawingIds.push(threadIndex[i].drawingId);
  }

  return drawingIds.map((drawingId) => binarySearch(drawings, drawingId));
}

Now, you're probably looking at this and wondering why we've chosen a structure that requires an O(log(n)) binary look-up when we could instead use a dictionary and get O(1) look-ups. O(1) is faster than O(log(n))! Hash-based indexes have two major drawbacks: they don't allow range queries (where key_col between X and Y), and they don't allow querying by parts of an index (an index on (classId, studentId) will support queries with only classId and no studentId) or prefixes (where key_col LIKE 'prefix%). Most major databases do support creating hash-based indexes if you actually want to create a key-value store. (JS hash-index pseudocode shows what this might look like in JS).

Our classId-sorted array of { classId, drawingId } supports queries that filter for a particular classroom's drawings, but if we wanted to filter for a particular student's drawings so that they can show off their portfolio to their classmates, we're out of luck. If we know what classes the student is in, we could speed up our work by first filtering for those classes and then filtering by the studentId: classIds.flatMap(findDrawingsByClassId).filter((drawing) => drawing.studentId === studentId).[2] But if we want to optimize a search by studentId, we'll want to create another sorted array, this one with studentId.

Now, what if we have a query that restricts by both classId and studentId? select * from drawings where classId = X and studentId = Y. There's a common misunderstanding that we can use both the classIdIndex and studentIdIndex to respond to that query, but if you give it a try with our sorted arrays, you'll quickly see that that we can only use one of our sorted index arrays for the initial part of the lookup and then need to do a filter scan on the data that we've returned. We'll choose one of our possible query plans (probably first querying using the studentIdIndex sorted array and then filtering by classId)[3], but there's no magic that lets us use both indexes.[4]

So, how do we write a more efficient query? Say we want to look up the most recent 10 drawings for a student in a particular class. What sorted array (or index) would you build? We don't want to create an array sorted on classId, one on studentId, and one on publicationDate. Instead, we want to create an array that's sorted by classId, studentId, and publicationDate (or one sorted by studentId, classId, and publicationDate). When things are sorted that way, we can efficiently find where the drawings for our classId and studentId pair start and then move through the array backwards until we've grabbed 10 drawings.

Why is publicationDate important as the last item that we're sorting by? What happens if we don't have it? Without it, we'll need to first find all of our drawingIds that match classId and studentId, then we need to look up all of the drawings for that combo of IDS, and finally we sort that array of drawings by publicationDate before we can return our data.[5] Similarly, what happens if we sort our array by classId, publicationDate, studentId? Our query-plans can still be reasonably effective—we can start scanning back through our array index, looking for our studentId. If we put publicationDate first though, it's going to be hard to come up with an efficient query-plan.

Now, what happens with our classId, studentId, publicationDate sorted array when we have a query for the most recent 10 drawings for the class? How would we write code to use our classId, studentId, publicationDate sorted array? We might end up writing code where we look up the most recent 10 drawings for each student in the class using the classId, studentId, publicationDate sorted array, merge those lists of 10 drawings together, and then return the most recent 10 drawings from that list. If we have 20 students in the class, we might be doing 20x the amount of work!

To avoid this extra work, we might want to create yet another sorted array index on studentId, publicationDate. That solves our problem, but with all of this talk about creating sorted arrays willy-nilly, keep in mind that whenever we add another sorted array, we need to do work to update that sorted array whenever we insert or update data, and we're increasing the amount of data that we're storing. That can often be the right trade-off, but as magical as our sorted arrays are, they're not free. At this point, we've created arrays sorted by the following fields.

But if an array is sorted by ["classId", "studentId", "publicationDate"], that array can do everything that an array sorted by ["classId"] or ["classId", "studentId"] can do—those other arrays are now redundant. We only need the arrays on ["classId", "studentId", "publicationDate"] and ["studentId", "publicationDate"] to efficiently serve the queries that we've discussed.

Using sorted arrays to reason about indexes isn't perfect, particularly when we start talking about writes, but I hope it's a useful tool you can use to reason about indexes!

Addenda

Rough Binary Search

function findRange<T, K extends keyof T>(
  arr: T[],
  key: K,
  start: T[K],
  end: T[K]
) {
  const leftIndex = binarySearch(arr, key, start);
  const rightIndex = binarySearch(arr, key, end, leftIndex);
  return [leftIndex, rightIndex];
}

function binarySearch<T>(
  arr: T,
  search: T,
  left: number = 0,
  right: number = arr.length - 1
): number {
  if (search < arr[left]) return left - 1;
  if (search > arr[right]) return right + 1;

  const halfwayIndex = Math.floor((left + right) / 2);
  if (arr[halfwayIndex] === search) {
    return halfwayIndex;
  }

  const [l, r] =
    search < arr[halfwayIndex]
      ? [left, halfwayIndex - 1]
      : [halfwayIndex + 1, right];
  return binarySearch(arr, search, l, r);
}

JS hash-index pseudocode

const classHashIndex = organizeBy(
  drawings,
  ({ classId }) => classId,
  ({ drawingId }) => drawingId
);
function organizeBy<T>(
  arr: T[],
  getKey: (o: T) => string,
  getValue: (o: T) => infer R = (o) => o
) {
  const hashIndex: Record<string, R> = {};
  for (const o of arr) {
    const key = getKey(o);
    const value = getValue(o);
    hashIndex[key] ??= [];
    hashIndex[key].push(value);
  }
  return hashIndex;
}

function findDrawingsByClassId_hashedIndex(classId: number) {
  const drawingIds = classHashIndex[classId] ?? [];
  return drawingIds.map((drawingId) => binarySearch(drawings, drawingId));
}

Code to find the most recent drawings by class

// https://lodash.com/docs/4.17.15#orderBy
const classIdStudentIdPublicationDateIndex = _.orderBy(
  drawings.map(({ classId, studentId, publicationDate, drawingId }) => ({ classId, studentId, publicationDate, drawingId}),
  ["classId", "studentId", "publicationDate"],
);

function findMostRecentNDrawingsForClass(classId: number, n: number = 10) {
  // in a real tree structure, this shouldn't require scanning all for the drawings. This sorted array comparison isn't perfect!
  const left = findLeftmostIndex(classIdStudentIdPublicationDateIndex, x => x.classId, classId); // binary search to find the point where it goes from < classId to classId
  const studentIds = new Set();
  for (let i = left; i < classIdStudentIdPublicationDateIndex.length && classIdStudentIdPublicationDateIndex[i].classId === classId; i++) {
    studentIds.add(classidStudentIdPublicationDateIndex.studentId);
  }

  // we could obviously optimize this step a bit rather than doing this full in-memory sort because we know the arrays are pre-sorted
  return _.orderBy(
    Array.from(studentIds.values()).flatMap((studentId) => findMostRecentNDrawingsForClassStudent(classId, studentId, n)),
    ["publicationDate"],
  ).slice(-10);
}

// rather than looking up each student's most recent posts, we could try a different query plan where we scan through all of the posts for the class
function findMostRecentNDrawingsForClass_noStudentDrawingMerge(
  classId: number,
  n: number = 10
) {
  const priorityQueue = new PriorityQueue((p) => p.publicationDate); // https://en.wikipedia.org/wiki/Priority_queue
  const drawingsForClass = findDrawingsByClassId(classId);
  for (const {
    classId,
    studentId,
    drawingId,
    publicationDate,
  } of drawingsForClass) {
    const oldest = priorityQueue.peek();
    if (publicationDate < oldest.publicationDate) continue;
    priorityQueue.add({ drawingId, publicationDate });
    if (priorityQueue.length > n) priorityQueue.pop();
  }

  return priorityQueue
    .values()
    .reverse()
    .map(({ drawingId }) => binarySearch(drawings, drawingId));
}

  1. Scanning over the entire table works longer than you'd expect! Many workloads fit entirely in memory, so the database may not even need to go to disk. But even if we do, SSDs are pretty quick—wikipedia tells me that 5 GB/s SSDs exist, but even 0.5 GB/s might be fast enough for some uses. Databases are fast! ↩︎

  2. This is one reason why a query like select * from drawings where classId in (1, 2, 3) and studentId = 4 may be faster than querying for select * from drawings where studentId = 4. We may end up with the exact same drawings, but the extra information about which classes to look in lets can let the query-planner come up with a better plan! ↩︎

  3. A query-planner will look at heuristics and previous attempts to figure out which query plan (querying using the classIdIndex or the studentIdIndex) will be faster. It might look at a heursitic like "which index is more selective?" and choose to use the studentId index because it's probably more restrictive. We say that the studentIdIndex has a "higher cardinality." ↩︎

  4. In rare situations in some databases, I think you can actually end up using two indexes in a single look-up like this. For a large query, you might want to rule out certain disk blocks that are impossible because they're not in the intersection of disk blocks that two indexes share. This isn't something to worry about, and I've probably bungled the explanation. ↩︎

  5. In MySQL, lacking an appropriate index for sorts means that it will do a filesort. Despite the name, this doesn't mean that it's writing a file—the sort might still be in-memory. Even in-memory sorts aren't a great sign—it can mean that you're pulling a lot of documents before sorting to figure out which ones that you want to return. ↩︎