EXPLAIN in SQL: how to read a query plan
Contents:
Why EXPLAIN is the single best debugging tool
A PM at Stripe pings you at 4:55 PM. The cohort retention dashboard you shipped last week is timing out, the board meeting is tomorrow, and "can you just make it faster" is the entire ask. You open the query: 90 lines, four CTEs, a join across a billion-row events table. Where do you start?
You start with EXPLAIN. The query plan is the database's own answer to "what are you actually going to do when I press run?" Every analyst eventually hits the wall where writing more SQL is not the answer — the answer is reading what the optimizer chose and noticing where it went wrong. Airbnb, Snowflake, and Databricks expect mid-level analysts to do this fluently, and the topic shows up in interviews dressed as "your query is slow, walk me through your debugging".
EXPLAIN output looks intimidating for about a day, then becomes routine. There are roughly four scan types, three join algorithms, and a handful of failure modes you will see over and over. Once you can map the tree of nodes to what the database is physically doing, fixing performance becomes mechanical: add a covering index, refresh statistics, rewrite a correlated subquery, raise work_mem.
EXPLAIN vs EXPLAIN ANALYZE
EXPLAIN prints the plan the optimizer has chosen. It does not execute the query. The numbers it prints are estimates derived from table statistics — row counts, distinct values, histograms.
EXPLAIN
SELECT * FROM orders WHERE user_id = 42;EXPLAIN ANALYZE actually runs the query and prints both the estimates and the real measurements: actual time per node, actual rows, number of loops. Reach for this when you need to know whether the optimizer's guess matched reality.
EXPLAIN ANALYZE
SELECT * FROM orders WHERE user_id = 42;One safety note: EXPLAIN ANALYZE executes the statement for real. If the statement is a DELETE or UPDATE, the rows will be changed. The standard workaround is to wrap it in an explicit transaction and roll back.
BEGIN;
EXPLAIN ANALYZE DELETE FROM orders WHERE id = 1;
ROLLBACK;For SELECT-only analytics the wrapper is not required, but the habit is worth building.
Reading the output, line by line
Here is a minimal Seq Scan plan from EXPLAIN ANALYZE:
Seq Scan on orders (cost=0.00..1250.00 rows=50 width=72)
(actual time=0.015..12.340 rows=48 loops=1)
Filter: (user_id = 42)
Rows Removed by Filter: 99952The node label Seq Scan on orders is the physical operation — the database is reading the table top to bottom rather than using any index. The pair cost=0.00..1250.00 is the optimizer's estimate in arbitrary units: startup cost first, total cost second. These units are not seconds — they only compare alternatives for the same query.
The rows=50 value is the optimizer's estimate of how many rows this node will produce. Compare it with actual ... rows=48, the truth measured at runtime. A small gap is fine; a 1,000x gap is a red flag that statistics are stale or the data is skewed. The actual time=0.015..12.340 value is real milliseconds: time to first row, then time to last row, averaged per loop.
Filter: (user_id = 42) shows the predicate applied at this node. Rows Removed by Filter: 99952 is the number of rows read and thrown away. When that number dwarfs what the node returns, you are paying for I/O you do not need — the signal an index would help.
Plans are trees. Indentation marks children. Children execute first, then hand their rows to their parent. Read from the most-indented node upward.
Scan types you will actually see
There are four scans that cover 95% of analytics workloads.
Seq Scan reads the table from start to finish. It is fast when the table is small, when the predicate matches a large fraction of rows, or when no usable index exists. Beginners assume Seq Scan is always bad — on a 100-row dimension table, Seq Scan beats Index Scan because the index lookup adds overhead that exceeds the cost of reading the whole table.
Index Scan uses an index to find candidate rows, then visits the table heap to fetch each match. It wins when the predicate is selective — pulling 1% or fewer of the table. Index Only Scan is the same except every column the query needs is already in the index, so the heap visit is skipped. This is what covering indexes give you.
Bitmap Index Scan and Bitmap Heap Scan come as a pair: walk one or more indexes, build a bitmap of matching pages, then read each marked page once. This pattern wins for OR conditions, moderate-selectivity range scans, and queries with multiple indexed predicates.
When a Seq Scan sits under a node that produces only a small fraction of the table's rows, that is your first place to look for a missing index.
Join algorithms in the plan
Nested Loop does the simplest thing: for each row from the outer input, scan the inner input for matches. Cost is roughly outer_rows × inner_cost_per_lookup. It wins when the outer side is small and the inner side has an index on the join key. It is catastrophic when both sides are large and the inner side has no index — an O(n × m) disaster.
Hash Join builds an in-memory hash table from the smaller input on the join key, then streams the larger input through and probes the hash. It is the workhorse for analytics: large tables, equality joins, no indexes required. The hash table needs to fit in work_mem; if it does not, Postgres spills batches to disk.
Merge Join requires both inputs to be sorted on the join keys, then walks the two streams in parallel. The optimizer picks it when the inputs are already sorted (often because of an index that produces them in order) and volumes are large enough that streaming beats hashing.
In day-to-day analytics, Hash Join and Nested Loop cover almost everything.
A worked example: from 2,500 cost to 15
A common query counts events per day for a single user:
EXPLAIN ANALYZE
SELECT
DATE_TRUNC('day', created_at) AS dt,
COUNT(*) AS answers
FROM training_events
WHERE user_id = 42
AND created_at >= '2025-01-01'
GROUP BY dt
ORDER BY dt;With no relevant index:
Sort (cost=2500.00..2500.05 rows=20 width=16)
-> HashAggregate (cost=2490.00..2495.00 rows=20 width=16)
-> Seq Scan on training_events (cost=0.00..2480.00 rows=50 width=8)
Filter: (user_id = 42 AND created_at >= '2025-01-01')
Rows Removed by Filter: 1999950The Seq Scan read 2 million rows and threw away 99.99%. The fix is a composite index that supports both predicates:
CREATE INDEX idx_events_user_date
ON training_events (user_id, created_at);Re-run EXPLAIN ANALYZE:
Sort (cost=15.20..15.25 rows=20 width=16)
-> HashAggregate (cost=14.00..14.50 rows=20 width=16)
-> Index Scan using idx_events_user_date on training_events
(cost=0.43..13.50 rows=48 width=8)
Index Cond: (user_id = 42 AND created_at >= '2025-01-01')Cost fell from roughly 2,500 to 15. The Seq Scan is gone, and the predicates moved from a Filter (applied after reading) to an Index Cond (applied during index traversal). This is the exact pattern interviewers want to see when they ask "your dashboard is slow, what do you do".
Common pitfalls
When estimated rows diverge sharply from actual rows, the plan downstream of that node is suspect. The optimizer chooses join algorithms and sort strategies from those estimates, so a 100x miss can flip a Hash Join into a Nested Loop and turn a 200ms query into a 200-second one. The fix is almost always ANALYZE table_name; to refresh statistics, or CREATE STATISTICS on skewed columns to capture cross-column correlations.
A Seq Scan on a large fact table with high Rows Removed by Filter is a textbook missing-index signal. A subtler version is a predicate the index cannot use: WHERE DATE_TRUNC('day', created_at) = '2025-01-01' will not use a plain (created_at) index because the function wraps the column. Rewrite as a range — created_at >= '2025-01-01' AND created_at < '2025-01-02' — or build a functional index.
Sort nodes that spill to disk are easy to miss. The tell is a line saying Sort Method: external merge Disk: NkB. For an interactive query this kills latency. The fixes, cheapest first, are to add an index that already returns rows in the required order, raise work_mem for the session, or push the sort upstream so fewer rows reach it.
A Nested Loop where the inner side is a Seq Scan over millions of rows is one of the worst plans you can see, and it almost always means the optimizer was wrong about outer cardinality. Find the estimate that is wildly off, then refresh statistics or rewrite the join shape so the planner has stable cardinalities to reason about.
Reading only the top node is another classic trap. The total cost at the root sums everything below, so a Sort showing cost=50000 is not the problem — the problem is the Seq Scan inside it. Scan downward, find the most-indented node with surprising numbers, and start there.
Optimization tips
Composite index column order matters and is not symmetric. An index on (user_id, created_at) supports queries filtering on user_id alone or both columns, but not a query filtering only on created_at. Lead with equality predicates and put range predicates last.
Covering indexes turn Index Scan into Index Only Scan by including non-key columns in the payload: CREATE INDEX ... ON tbl (key_cols) INCLUDE (extra_cols);. The payoff is largest for narrow read-mostly tables.
For analytic batch jobs over hundreds of millions of rows, raising work_mem at session scope keeps hashes and sorts in memory and eliminates disk spills. Do it per session, not globally — every connection multiplies the cost.
Partitioning helps once a fact table outgrows what cache and VACUUM can comfortably hold. Range-partition by month on created_at and the planner will prune partitions automatically when the WHERE clause has a date filter.
Related reading
- SQL window functions interview questions
- CTE vs subquery in SQL
- CTE vs temp table in SQL
- DISTINCT vs GROUP BY in SQL
If you want to drill query-plan reading on real interview prompts every day, NAILDD is launching with 500+ SQL problems built around the exact patterns hiring managers ask about.
FAQ
Is it safe to run EXPLAIN ANALYZE in production?
For a SELECT, yes — it just executes the statement, which a regular dashboard would do anyway. The caveat is that a heavy query will run fully, so EXPLAIN ANALYZE on a 30-minute query costs you 30 minutes plus a load spike. For INSERT, UPDATE, or DELETE, wrap in BEGIN ... ROLLBACK. Plain EXPLAIN (without ANALYZE) is always safe because nothing executes — only planning.
How often do I actually need to run ANALYZE?
Postgres ships with autovacuum, which periodically refreshes statistics. Cases where you need ANALYZE manually: after bulk loads (COPY of millions of rows), after large DELETE or UPDATE batches, and right after creating a new table you intend to query immediately. Stale statistics are a top-three cause of bad plans, so when a query suddenly slows down, an explicit ANALYZE table_name; is a fast first move.
Do analysts really need this on an interview?
For junior roles, plan-reading is a bonus. At mid-level and above — an L4 at Google or an equivalent band at Meta, Stripe, or Airbnb — it is squarely in scope. The expected behaviour is: "I would run EXPLAIN ANALYZE, look for nodes where estimated and actual rows diverge or where Seq Scan reads millions of rows it then throws away, and decide whether to add an index, refresh statistics, or rewrite the predicate." Demonstrating this signals that you understand storage and execution, not just syntax.
Does this work the same way in BigQuery, Snowflake, ClickHouse, MySQL?
The principles transfer: every modern engine has a planner, and "fix the part that reads too many rows" is universal advice. The syntax differs. BigQuery exposes execution details as a graph with bytes-read and slot-ms. Snowflake's query profile is graphical and surfaces partitions pruned and spillage. ClickHouse uses EXPLAIN with pipeline notation. MySQL's EXPLAIN returns a tabular row-per-step format. The mental model — read children first, look for full scans and bad row estimates — carries everywhere.
What is the difference between cost and actual time?
cost is a unit-free estimate the planner uses to compare alternatives. The absolute number is not meaningful on its own. actual time is the real wall-clock measurement in milliseconds collected at execution. When you tune queries, you mostly care about actual time per node and how it relates to the estimated rows. If cost says a plan is twice as expensive but actual time is identical, the cost model was wrong about your workload — interesting, but not by itself a problem.