A one-table SQL query will often use a single index

Say you have a query like SELECT * FROM widgets WHERE product_line = X. How many indexes will it use? If we have an index on (product_line), it will use that index. Easy-peasy.

But what if our query changes to SELECT * FROM widgets WHERE product_line = X and price = Y. If we have two indexes, one on (product_line) and one on (price), which will it use? Both?

One of the indexing mistakes that I see folks make most often is thinking that the query will always be able to efficiently combine both indexes, regardless of the selectivity of each. A lot of the time, a better query plan will be to use the more selective index (let's assume it's product_line), and then filter the widgets for ones that match the price:

Use the (product_line) index to look up the set of primary keys (widgetId) that match product_line = X.
Use the widgetIds to look up the widgets.
Filter those widgets to only include those where price = Y.

What would it look like if the query planner did decide to use both indexes? An index is a sorted array that lets you look up the primary key, so it might look something like this:

Use the (product_line) index to look up the set of widgetIds that match product_line = X.
Use the (price) index to look up the set of widgetIds that match price = Y.
Take the intersection of these two sets of widgetIds.
Use the widgetIds to look up the widget.*

In MySQL, combining indexes like this is called index merge optimization. PostgresSQL calls it Combining Multiple Indexes. In MongoDB, it's called index intersection.

The query planner will sometimes decide to combine indexes like this to make queries more performant, but only if it detects that both indexes have high enough cardinality (selectivity) that it will make the query faster. Even then, a compound index will often perform better than using two indexes like this, and from a mental model perspective, if you're not regularly doing database work, I think it's best to think about queries as only ever being satisfiable by one index.

If we think about our query (SELECT * FROM widgets WHERE product_line = X and price = Y) as needing only a single index, what index would we want to build? Either (product_line, price) or (price, product_line) depending on the cardinality of product_line and price. With an index on (product_line, price), here's what things look like:

Use the (product_line, price) index to look up the widgetIds that match product_line = X and price = Y.
Use the widgetIds to look up the widgets.

I hope all of this at least sounds reasonable, but sounding reasonable isn't good enough. Let's set up our widget table, insert some example data, and then play with our indexes a little bit. I'm going to use MySQL for this because it's what I'm most familiar with, but the behavior should be similar across any flavor of OLTP database.

First off, let's set up some example data:

drop table if exists widgets;
CREATE TABLE widgets (
  widgetId BIGINT UNSIGNED AUTO_INCREMENT PRIMARY KEY,
  price DECIMAL(10,2) NOT NULL,
  product_line VARCHAR(100) NOT NULL,
  details TEXT,
  INDEX idx_product_line (product_line),
  INDEX idx_price (price)
);
insert into widgets (price, product_line, details)
select
    num % 5 as price,
    cast(num % 196 as char) as product_line,
    cast(num as char) as details
from numbers -- see "setting up a view with numbers"
where num between 0 and 99999;
analyze table widgets; -- this should happen automatically

At this point, we have a table of widgets, an index on (price) and an index on (product_line). Let's see what our initial query does. After running explain analyze select * from widgets where price = 2 and product_line = "3";, we'll get the following output:

| -> Filter: ((widgets.product_line = '3') and (widgets.price = 2.00))  (cost=75.5 rows=167) (actual time=0.888..3.66 rows=102 loops=1)
    -> Intersect rows sorted by row ID  (cost=75.5 rows=168) (actual time=0.883..3.63 rows=102 loops=1)
        -> Index range scan on widgets using idx_product_line over (product_line = '3')  (cost=6.62 rows=511) (actual time=0.0376..0.117 rows=486 loops=1)
        -> Index range scan on widgets using idx_price over (price = 2.00)  (cost=13.3 rows=32850) (actual time=0.00633..2.76 rows=20000 loops=1)

Not used to reading these query plans? Start from the bottom. The first (cost=13.3 rows=32850) is the query planner's estimate for the cost and the number of rows it will find. The second (actual time=0.00633..2.76 rows=20000 loops=1)is what actually happened when we ran the query. It fetched the first row after 0.00633ms and the last after 2.76ms. Parent stages take children as input.

This is using the index intersection optimization, and it's a great example of why that's not always a good idea. Take a look at that rows=20000; it's looking up 20,000 widgetIds from the idx_price before intersecting them with the 486 rows it looked up from idx_product_line. This isn't a great query plan!

Let's see what happens when we give it a hint—explain analyze select * from widgets IGNORE INDEX (idx_price) where price = 2 and product_line = "3":^[1]

| -> Filter: (widgets.price = 2.00)  (cost=133 rows=51.1) (actual time=0.196..0.71 rows=102 loops=1)
    -> Index lookup on widgets using idx_product_line (product_line='3')  (cost=133 rows=511) (actual time=0.0244..0.672 rows=511 loops=1)

This query is faster—it starts returning in 0.196ms and finishes returning in 0.71ms, while the other query takes 3.5ms to finish returning. By telling MySQL to ignore the index, we guided it to choose a query plan where it looks up 520 widgetIds from idx_product_line and then looks up the widgets to see which ones have the price that we want.

But what happens if we give it a an index on (product_line, price)? In theory, this should let it do its filtering on the index before looking up a single widget.

> alter table widgets add index idx_product_line_price (product_line, price);
> explain analyze select * from widgets where price = 2 and product_line = "3";

| -> Index lookup on widgets using idx_product_line_price (product_line='3', price=2.00)  (cost=35 rows=100) (actual time=0.0184..0.102 rows=100 loops=1)

With the index, we need to look up ~1/20th of the rows that the initial index merge optimization query does, which ends up being a lot faster. ou can see that this last query is doing the least amount of work: it just looks up the 100 rows that match our query predicate and returns them to us.

Running these simple queries a few times is the furthest thing in the world from a good performance test, and our little query on widgets isn't a particularly representative of a production workload. It's not doing any joins, any sorts, and its data isn't realistic. But I think it's good enough to get a feel for how a query using multiple indexes actually behaves in practice, and why I tend to guide people away from relying on index-merge optimization to make their queries perform well.

I'm sure there are database loads where it makes sense to rely on index merges! I can imagine that they'd perform well in situations where you have multiple reasonably selective indexes that might be combined and sorted in arbitrary ways—e.g., for querying a store—but the database loads that I've worked with haven't benefitted from this optimization, and the idea that a query can use multiple indexes in a lookup has occasionally led folks astray.

Addendum: setting up a view with numbers

FizzBuzz in MySQL 5.7 has a few notes on how this and similar techniques to generate lists of numbers in SQL work.

create or replace view digits as
    select 0 as d
    union select 1
    union select 2
    union select 3
    union select 4
    union select 5
    union select 6
    union select 7
    union select 8
    union select 9;
create or replace view numbers as
    select
        ten_thou * 10000 + thou * 1000 + hun * 100 + ten * 10 + one as num
    from (
        select ten_thou.d as ten_thou, thou.d as thou, hun.d as hun, ten.d as ten, one.d as one
        FROM
            digits as ten_thou,
            digits as thou,
            digits as hun,
            digits as ten,
            digits as one
    ) nums
    ORDER BY 1 ASC;

In MySQL, the index merge optimization is a switchable optimization that you can turn off: docs. Rather than ignoring an index for a query, you can tell the query planner to instead ignore that query optimization for the query. ↩︎

A one-table SQL query will often use a single index

Addendum: setting up a view with numbers

Do you like fantasy?