SQL anti-patterns on a DE interview

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why anti-patterns come up on a DE interview

Every senior data engineering loop has the same hidden question: can you make a slow query fast without breaking it. Anti-patterns are the shortest way to test that, because they are concrete and require you to reason about the planner instead of regurgitating syntax. At Stripe, Snowflake, Databricks, Airbnb, and Uber the screen often opens with a working query and "the on-call woke me up — what is wrong." The candidate who pattern-matches SELECT *, a wrapped column, or an implicit cast in fifteen seconds and explains why the index is unused is the one who gets the onsite.

Anti-patterns map well to signal because each one combines three skills the team needs on day one. You read the query like a compiler — what columns travel over the wire, what predicates are sargable, what types collide. You predict the plan — index versus sequential scan, hash versus nested loop. And you suggest a fix that does not regress a different workload.

SELECT * and projection bloat

The first anti-pattern almost every loop will show you is SELECT *. It looks harmless on a small table, which is why it is the favourite trap. The cost is not in the WHERE clause — it is everything downstream: wider rows over the wire, more buffer pages read, broken covering-index plans, and a schema-coupling bomb that detonates the day a column is added upstream in dbt.

-- bad: pulls every column even if the caller only needs three
SELECT * FROM users WHERE id = 42;

-- good: stable projection, plays nicely with covering indexes
SELECT id, name, email FROM users WHERE id = 42;

The interviewer is listening for three points together. Network and memory cost — wide rows blow out the result buffer, especially with JSONB or TEXT columns the caller never reads. Plan stability — a covering index on (id, name, email) enables an Index Only Scan with the explicit projection, while SELECT * forces a heap fetch per row. And schema coupling — SELECT * in an ETL job is a future incident, because the next column added to users silently flows into the warehouse and breaks the downstream contract.

Functions on indexed columns

The classic "the index is not used" problem is almost always a function wrapping the indexed column. The B-tree is sorted on the raw value; wrapping it in LOWER, DATE, CAST, or COALESCE makes the predicate non-sargable. The planner does not know how to map your predicate back to the index keys, so it falls through to a sequential scan and an outraged Slack message.

-- bad: LOWER prevents the planner from using the email index
SELECT * FROM users WHERE LOWER(email) = 'a@b.com';

-- good: functional index or normalize at insert time
CREATE INDEX idx_users_email_lower ON users (LOWER(email));
SELECT * FROM users WHERE LOWER(email) = 'a@b.com';

The same trap shows up with dates. WHERE date(created_at) = '2026-05-18' looks fine but discards an index on created_at; rewrite it as a half-open range — created_at >= '2026-05-18' AND created_at < '2026-05-19' — and the planner snaps back to an index range scan. Bonus answer for Snowflake or BigQuery interviewers: if the table is partitioned on created_at, the wrapped form also defeats partition pruning, so a five-second query becomes a full table scan and a non-trivial bill.

Implicit type casts

Implicit casts are the same family of bug as functions on indexed columns, but they hide better because no function is visible in the SQL text. The column is one type, the literal is another, and the engine inserts a cast on the indexed side to make the comparison work. That cast wraps the column the same way LOWER does, and the index is silently skipped.

-- bad: id is INT, literal is text; MySQL casts the column to TEXT
SELECT * FROM users WHERE id = '42';

-- good: types match, B-tree on id is used directly
SELECT * FROM users WHERE id = 42;

The textbook example is MySQL comparing an INT column to a string parameter — the engine widens the column rather than the literal, which is the worst possible direction. Postgres is stricter and usually picks the right side, but it still bites when a VARCHAR column meets a numeric parameter from an ORM. Name the engine-specific rule and check EXPLAIN not just for index usage but for hidden (::int) or CAST(... AS varchar) annotations on the indexed expression.

N+1 queries

N+1 is the only application-layer anti-pattern that DE interviews still drill, because every pipeline glues SQL to a host language somewhere — Python orchestrators, dbt macros, custom backfill scripts. The shape is always the same: one query to fetch the parents, then one query per parent to fetch the children. With 10,000 users that is 10,001 round-trips and a flatlined warehouse.

# bad: 1 + N round-trips, latency dominated by network
users = db.query("SELECT id FROM users")
for user in users:
    orders = db.query("SELECT * FROM orders WHERE user_id = ?", user.id)
-- good: a single join, the planner picks the right strategy
SELECT u.id, o.order_id, o.total
FROM users u
LEFT JOIN orders o ON o.user_id = u.id;

The fix is either a join or a batched IN (...) lookup driven by an array of parent IDs. Which one wins depends on cardinality — for small parent sets and indexed child tables the IN batch is often faster and avoids row duplication when there are many children per parent. Note that N+1 is rarely visible in EXPLAIN because each individual query looks fine; you catch it in tracing, pg_stat_statements, or the latency histogram.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

OR vs UNION across columns

OR predicates across different columns are the textbook case where rewriting to UNION or UNION ALL unlocks a plan the optimizer otherwise refuses to consider. With WHERE x = 1 OR y = 2, the planner must choose a single scan strategy that covers both columns, and unless you happen to have a multicolumn index that fits, it punts and does a sequential scan. Splitting the OR into two queries lets each branch use its own narrow index.

-- often slow: single scan must satisfy both conjuncts
SELECT id FROM events WHERE x = 1 OR y = 2;

-- often fast: two index scans, combined and deduped
SELECT id FROM events WHERE x = 1
UNION
SELECT id FROM events WHERE y = 2;

Use UNION ALL when you can prove the two branches do not overlap, because UNION adds a dedup sort that can dominate runtime on large results. The signal is recognising that "OR is slow" is not universal — OR on the same column is essentially IN, which the planner handles well. The pathology is specifically OR across different columns with separate indexes, where the optimizer cannot fold the predicates into one access path.

Common pitfalls

The first pitfall is treating anti-patterns as universal rules. SELECT * inside a CTE that the planner inlines is often free; functions on indexed columns are fine if a matching functional index exists; OR can beat UNION when the two branches share most rows because UNION pays for a big dedup. The interview move is to caveat every rule with "unless the plan says otherwise" and look at EXPLAIN ANALYZE.

A second pitfall is fixing the symptom without measuring the regression. You rewrite WHERE LOWER(email) to use a functional index, but the index doubles the write cost on a hot path that takes 200 inserts per second, and ingestion lags. Anti-pattern fixes are trade-offs between read and write paths. State the trade-off and explain how you would measure it after deploy with pg_stat_user_indexes.

A third pitfall is over-rewriting. Junior engineers see SELECT * and reflexively list every column they can find, missing the point — what matters is whether the projection is narrow and aligned with the indexes the query actually touches. Listing thirty columns is no better than SELECT * if the consumer reads three.

A fourth pitfall is forgetting that warehouses behave differently. Snowflake has no B-tree indexes; its micro-partition pruning is the rough equivalent, and a function on a clustering key has different semantics than a function on a Postgres index. BigQuery is billed on bytes scanned, so SELECT * is literally billed.

A fifth pitfall is missing join-shape anti-patterns inside otherwise clean SQL. A correlated subquery in SELECT that runs once per outer row is structurally an N+1 in declarative clothes; rewrite it as a LEFT JOIN with GROUP BY or a window function.

Optimization tips

The cheapest optimization is to read the plan before touching the SQL. EXPLAIN ANALYZE tells you whether the index is used, whether the row estimate is off by 100x, and whether a sort spills to disk. Most anti-pattern rewrites fail because the engineer rewrote the wrong half. Make plan-reading the first move and the fix is usually smaller than expected.

When you need an index to fix a wrapped predicate, prefer a functional index over a denormalized column. A functional index on LOWER(email) is one line of DDL and stays in sync automatically; a lower_email column needs triggers or app code and drifts. The trade-off is that functional indexes only help queries written with the same expression — document it so the next engineer does not write a different form.

For N+1 in batch pipelines, push the loop into SQL. Replace for user in users: db.query(...) with one join, or a CTE that builds the parent list and a single child query keyed on it. If you cannot, batch with WHERE id = ANY($1) — that turns 10,000 round-trips into 10 batches of 1,000, a two-orders-of-magnitude latency win.

For OR-versus-UNION, do not rewrite blindly. Postgres has BitmapOr that combines two index scans on different columns when cardinalities are right. If you see a sequential scan and expect a small result, UNION ALL is the surgical fix; if the OR already gets a BitmapOr plan, leave it alone.

Treat anti-pattern hunting as routine. Run pg_stat_statements weekly, sort by total time, and read the top ten. Fixing that list once a week beats a six-month optimization sprint.

If you want to drill these exact anti-pattern questions daily, NAILDD is launching with 500+ SQL problems built around the patterns DE interviewers actually probe.

FAQ

Is SELECT * always bad?

No. Inside a CTE that the planner inlines, SELECT * is essentially free because the optimizer prunes the unused columns. It is also fine in throwaway analytics queries where the join dominates the cost. What matters in production is whether the projection is stable and aligned with the indexes — list the columns the consumer needs, and you avoid both the network cost and the silent schema-coupling failure when someone adds a column upstream.

How do I know if a predicate is sargable?

Sargable means the planner can use an index range to satisfy it. The cleanest check is EXPLAIN — Index Scan or Index Only Scan means yes, Seq Scan with the predicate listed as a Filter means no. The structural rule is that the indexed column must appear bare on one side of the comparison, with no wrapping function or implicit cast. Rewrite LOWER(col) = x as col = LOWER(x) only if the column is already stored lowercased; otherwise add a functional index.

Should I always replace OR with UNION?

No. OR on the same column is just IN, handled natively. OR across columns sometimes triggers a BitmapOr plan that already combines two index scans — in that case the SQL is fine. The rewrite to UNION ALL is only worth it when the planner picks a sequential scan and EXPLAIN ANALYZE shows two index scans would be cheaper.

What is the fastest way to detect N+1 in a pipeline?

Tracing and statement counts, not EXPLAIN. Each individual query in an N+1 looks correct on its own, so plan-reading misses it. Look at pg_stat_statements.calls — if a child statement is in the millions while the parent is in the hundreds, you have an N+1. In Python ETL, log the executed statement count per batch; in dbt, watch the run log for one model emitting many small queries.

How do warehouse engines change which anti-patterns matter?

Snowflake, BigQuery and Databricks do not use B-tree indexes the way Postgres does, so "function on indexed column" becomes "function on clustering key blocks micro-partition pruning" — same fix. SELECT * is worse on columnar warehouses because they are billed on bytes scanned. OR-versus-UNION matters less because warehouses are massively parallel and the optimizer is more aggressive about pushdown.

How should I phrase these on a panel interview?

Lead with the symptom, the diagnostic step, then the fix. "If this is slow I would suspect LOWER on email defeats the index — I would check EXPLAIN ANALYZE for a Seq Scan, and if confirmed add a functional index or normalize at write time." That shows you debug with evidence and you know the trade-off. Same pattern wins on EXPLAIN and indexing rounds.