An Elegy for my Beloved Slow Query Logs
I have an unreasonable fondness for database slow query logs. Each new one can be a delightful little problem. With the exact slow query, you can immediately repro what went wrong with EXPLAIN ANALYZE ${QUERY}. With Rows_examined and Rows_sent, you get an instant sense of whether the query was missing an index or choosing a poor query plan.1 Building a new index or rewriting a query and then seeing a latency graph drop off a cliff is incredibly satisfying.
# Time: 2026-03-04T19:06:09.160Z
# User@Host: my_user[my_user] @ [10.1.0.0.0] Id: 123456
# Query_time: 1.000419 Lock_time: 0.000002 Rows_sent: 1123 Rows_examined: 219307
SET timestamp=1772651157;
SELECT * FROM...
Optimizing a query like the one above might be fun, but one thing that’s impossible to see is whether this query is even necessary at all. Why did the application need these particular 1,123 rows? Tracking a query back to the spot that triggered it might be next to impossible, and even if we track it down to model code, multiple routes might use that exact same code path.
The most impactful optimizations I’ve done over the years have relied on understanding why a particular query is running and finding ways to change it to a simpler one or even avoiding it entirely. Updating a client to avoid hitting an expensive route as often can have a far larger impact than building an index or three.
In an ideal world, we’d be able to tell exactly where any slow query came from, and we already have a good spot to store this information: in the query itself! When we generate the query, we can include a comment that includes all of the debugging information that we might want later and that information will flow through to the slow query logs:
...
/* name:Widget.findByGroup route:/group/:groupId/widget/summary request_uuid:UUID client:iOS */ SELECT * FROM...
The specifics of how to create these query comments and what information you’ll want to include will vary based on how you’re constructing these queries and what your server looks like.2 Regardless of your setup, there should be some way to include at least the name of the query so you can tie it back to its code source. With query names and route information, we’ll have a much easier time tracing through why our application is running any particular slow query.
Do your queries not have names? Script Your Code Migrations covers how to automate a migration that does some minimal file parsing to come up with good-enough query names.
But that leads to our next problem: most of the time, the reason we’re looking at slow queries is that we want to reduce load on our database. Speeding up the slowest queries might help if the database is relatively unoptimized and many queries lack appropriate indexes, but many of these slow queries might be relatively rare: exporting data, generating reports, or any other use-case where a rarely run slow query isn’t actually something we need to fix.
From a database load perspective, the most expensive queries are normally the ones that are run often and require a decent amount of work. How do we identify those queries?
In MySQL, you can use a tool that queries the performance_schema.events_statements_summary_by_digest (docs) to see database duration by its query fingerprint – a normalized version of the query without query parameters. These tools are incredibly helpful, but they have the same problem that our initial slow query logs did: they don’t give us an application-aware view of why these queries are running, and they can only group by query fingerprint rather than application-specific keys like route.
One powerful technique covered in High Performance MySQL (non-affiliate link) is setting the long_query_time to 0 to generate a sample of EVERY query over a brief window. If we have a 1-minute window of slow-query logs for every single query our database handled, we can use our query comments to generate a report that displays our total database load cut by whatever fields we included in our query comment.
My favorite report cuts by route and then breaks timing down by query name:3
GET /api/groupId/:groupId/widgets/page query_time=857.69s (28.88% of total time), rows_sent=6836282, rows_examined=24070134, rows_examined/rows_sent=3.52
- Widget.findWidgetsByGroupId query_time=166.72s (19.44% of total time), rows_sent=669974, rows_examined=121498, rows_examined/rows_sent=0.18
- WidgetOwner.findByWidgets query_time=81.95s (9.55% of total time), rows_sent=794280, rows_examined=5012095, rows_examined/rows_sent=6.31
- Group.findByGroupId query_time=56.78s (6.62% of total time), rows_sent=993009, rows_examined=3968167, rows_examined/rows_sent=4.00
...
GET /api/expensiveReport query_time=331.11s (11.15% of total time), rows_sent=2475091, rows_examined=110990560, rows_examined/rows_sent=40.48
- Widgets.generateReport query_time=79.19s (23.92% of total time), rows_sent=102, rows_examined=1785683, rows_examined/rows_sent=17856.00
...
...
This view of what’s expensive for the database can give you a clear picture of where to focus optimization efforts compared solely breaking things down by query fingerprint or name. This sort of report can even let you break down load by owning team, which can be crucial for avoiding a database tragedy of the commons.
Over the years, I started to take this sort of tooling for granted. Of course it was trivial to see a slow query and understand why it was run and even see the exact request that triggered it. Of course we could look at reports of database load grouped by route, team, or query name. Of course ever query had a name, so we could compare charts of query performance before and after we adjusted something. I was only reminded of how valuable it was when I ran onboarding sessions for new engineers and showed them out it all worked.
Recently, I’ve been doing some work on a database hosted by Planetscale. Overall, it’s been nice – migrations are fast and easy, their use of sparklines to create an information-dense UI is fantastic, performance is solid, and I like having the built-in VTGate proxy for connection management – but suddenly not having access to the slow query log comments has been incredibly painful, and I’ve had to spend much more time trawling through the codebase to understand why various queries are happening before optimizing them. (The team values query names enough that we’ve been talking about using unneeded optimizer hints to add them to the query fingerprints that Planetscale uses.)
If you do have access to slow query logs that you haven’t decorated with your own metadata yet, it’s a practice I’d highly recommend! In general, the more you can augment all of your logs with the information that you’ll want when debugging something, the happier you’ll be.
As for me, I’ll be pouring out a libation of coffee to my dearly departed slow query logs and hoping that this post serves as a fitting elegy.
-
I’m going to use MySQL’s slow query logs for the examples in this post, but things don’t vary much by database. With MySQL’s slow query log, you can also turn on log_slow_extra to get even more information about your slow queries.
I’m less familiar with Postgres, and it looks like it gives you less actionable info overall, but does include a nice
appfield in addition touser.I rag on MongoDB pretty often, but they have my favorite slow query logs:
- They’re JSON, so you don’t need to write parsing code to pull out fields
- They includes a
planSummaryfield that shows which indexes it used for the query - In addition to
millis, they include atimeReadingDiskMillisso you can tell how much disk i/o contributed to the slowness of the query. (MongoDB Performance is Disk-graceful talks about why MongoDB deployments will often need more disk i/o) - They break down
keysExamined,docsExamined, andnreturned, so you can compare index reads vs. document reads.
MongoDB delenda est ↩︎
-
At ClassDojo, we had a QueryBuilder abstraction that we updated to include a query name whenever it was instantiated and grabbed request information from AsyncLocalStorage:
- AsyncLocalStorage makes the commons legible
- Using NodeJS’s AsyncLocalStorage to Instrument a Webserver
-
If it’s helpful, here’s what the parsing code looks like. It’s hacky and quick, but it works well enough to be valuable.
↩︎from parse import * from collections import defaultdict import textwrap def main (): queries_by_route = defaultdict(lambda: defaultdict(float)) queries_by_route_and_name = defaultdict(lambda: defaultdict(lambda: defaultdict(float))) query_totals = defaultdict(float) totals_by_team = defaultdict(lambda: defaultdict(float)) queries_by_name = defaultdict(lambda: defaultdict(float)) filename = "./mysql-slowquery.log" with open(filename, "r") as f: i = 0 curr = None deets = None should_log_queries = False for line in f: line = line.strip() if line.startswith("# Query_time"): deets = parse("# Query_time: {query_time:F} Lock_time: {lock_time:F} Rows_sent: {rows_sent:d} Rows_examined: {rows_examined:d}", line) continue curr = line.startswith("/* name") and search("/* name: {name} {method} {route} {context} {client_info} {team:w}:{product_area:w}", line) if not curr: continue # once we see the first sub 1s query, we know that we've set long_query_time=0 if deets["query_time"] < 1: should_log_queries = True if not should_log_queries: continue if i % 10000 == 0 and i: print("Done with", i) # this is pretty slow, so seeing how far along we've gotten is useful i += 1 key = None if curr["route"] != "unknown_route": key = curr["method"] + " " + curr["route"] elif curr["context"] != "regular_server": key = curr["context"] else: key = "pre-route middleware" for k, v in deets.named.items(): queries_by_route[key][k] += float(v) query_totals[k] += float(v) queries_by_route_and_name[key][curr["name"]][k] += float(v) queries_by_name[curr["name"]][k] += float(v) totals_by_team[curr["team"]][k] += float(v) with open(filename + ".top_queries", 'w') as top_query_file, open(filename + ".top_queries_breakdown", 'w') as top_query_breakdown_file, open(filename + ".by_name", 'w') as top_queries_by_name, open(filename + ".by_team", "w") as query_load_by_team: rendered_queries = [render(k, queries_by_route[k], query_totals) for k in queries_by_route.keys()] query_deets = sorted(rendered_queries, reverse=True) for (t, line) in query_deets: top_query_file.write(line + "\n") rows = sorted([render_with_name_breakdown(k, queries_by_route[k], queries_by_route_and_name[k], query_totals) for k in queries_by_route.keys()], reverse=True) rows = [x[1] for x in rows] for line in rows: top_query_breakdown_file.write(line + "\n") team_costs = sorted([item for item in totals_by_team.items()], key=lambda x: x[1]["query_time"], reverse=True) for k, v in team_costs: query_load_by_team.write(render(k, v, query_totals)[1] + "\n") query_costs = sorted([item for item in queries_by_name.items()], key=lambda x: x[1]["query_time"], reverse=True) for k, v in query_costs: top_queries_by_name.write(render(k, v, query_totals)[1] + "\n") def render (name, route_queries, query_totals): query_time = route_queries["query_time"] query_percent = 100.0 * query_time / query_totals["query_time"] return (query_time, f"""{name} query_time={query_time:.2f}s ({query_percent:.2f}% of total time), rows_sent={int(route_queries["rows_sent"])}, rows_examined={int(route_queries["rows_examined"])}, rows_examined/rows_sent={route_queries["rows_examined"]/(route_queries["rows_sent"] or 1):.2f}""") def render_with_name_breakdown (name, route_queries, queries_by_name, query_totals): top_line = render(name, route_queries, query_totals)[1] lines = sorted(queries_by_name.items(), key=lambda x: x[1]["query_time"], reverse=True) rendered_lines = "\n".join(["- " + render(name, deets, route_queries)[1] for name, deets in lines]) return (route_queries["query_time"], f"""{top_line} {textwrap.indent(rendered_lines, " ")} """) main()