SQL antipatterns — 15 mistakes analysts make
Contents:
Why antipatterns matter more than syntax
The difference between a junior and a senior analyst is not the queries they write — it is the queries they refuse to write. Seniors recognize the shape of a mistake before the query runs, which is the difference between a clean dashboard and a Slack thread at 11 p.m. asking why MRR jumped 40 percent overnight.
Interview loops at Snowflake, Databricks, Airbnb, and DoorDash often include a SQL screen where the question is "what is wrong with this query." Spot the integer division on a conversion rate in fifteen seconds and you move on; optimize the join order before noticing the NULL trap and the loop ends politely. This guide collects fifteen antipatterns from production and interview screens, with a broken version, a fix, and the reason the optimizer betrayed you.
The fifteen antipatterns
1. SELECT * in production
-- Bad
SELECT * FROM orders WHERE user_id = 42;
-- Good
SELECT order_id, amount, order_date, status
FROM orders
WHERE user_id = 42;SELECT * pulls every column — wide JSON, TEXT fields, everything — inflating traffic, defeating covering indexes, and breaking when a column is added upstream. Use it for exploration, never in a dashboard view or scheduled job.
2. Integer division on conversion rates
-- Bad: 5 / 2 = 2 in most engines, not 2.5
SELECT
COUNT(CASE WHEN converted THEN 1 END) / COUNT(*) AS conversion
FROM users;
-- Good
SELECT
COUNT(CASE WHEN converted THEN 1 END)::NUMERIC / COUNT(*) AS conversion
FROM users;Dividing two integers gives an integer in PostgreSQL, Snowflake, BigQuery INT64, and MySQL. A conversion rate of 0.4 prints as 0. Cast one operand to NUMERIC or FLOAT.
3. Aggregate function inside WHERE
-- Bad: syntax error
SELECT user_id, COUNT(*) AS orders
FROM orders
WHERE COUNT(*) >= 5
GROUP BY user_id;
-- Good: HAVING
SELECT user_id, COUNT(*) AS orders
FROM orders
GROUP BY user_id
HAVING COUNT(*) >= 5;WHERE filters rows before grouping, so aggregates are not allowed there. To filter on aggregate results use HAVING, which runs after grouping. WHERE is row-level; HAVING is group-level.
4. NOT IN with NULL
-- Bad: returns 0 rows if the subquery contains any NULL
SELECT * FROM users
WHERE user_id NOT IN (SELECT referred_by FROM users);
-- Good: NOT EXISTS
SELECT * FROM users u
WHERE NOT EXISTS (
SELECT 1 FROM users u2 WHERE u2.referred_by = u.user_id
);
-- Or filter NULL out of the IN list
SELECT * FROM users
WHERE user_id NOT IN (
SELECT referred_by FROM users WHERE referred_by IS NOT NULL
);NOT IN against a list with any NULL yields zero matches because x != NULL is UNKNOWN and AND UNKNOWN collapses the predicate. The most common silent bug in analyst SQL. NOT EXISTS handles NULL correctly. See NULL in SQL.
5. Implicit cross join
-- Bad: 1000 x 500 = 500,000 rows
SELECT * FROM orders, users;
-- Good
SELECT * FROM orders o
JOIN users u ON o.user_id = u.user_id;Comma-separated tables without ON produce a Cartesian product. On a 200M-row events table joined to 50M users, the warehouse bill arrives the next morning. Always use explicit JOIN with ON.
6. JOIN fan-out on duplicate keys
-- Rows multiply on one-to-many JOINs
SELECT
u.user_id,
u.name,
SUM(o.amount) AS total
FROM users u
JOIN orders o ON u.user_id = o.user_id
JOIN payments p ON u.user_id = p.user_id -- multiplies ROWS
GROUP BY u.user_id, u.name;
-- Good: aggregate BEFORE the join
WITH user_orders AS (
SELECT user_id, SUM(amount) AS total_orders FROM orders GROUP BY user_id
),
user_payments AS (
SELECT user_id, SUM(amount) AS total_payments FROM payments GROUP BY user_id
)
SELECT u.name, uo.total_orders, up.total_payments
FROM users u
LEFT JOIN user_orders uo ON u.user_id = uo.user_id
LEFT JOIN user_payments up ON u.user_id = up.user_id;When you join two one-to-many child tables through the same parent, every row of one child is duplicated by every row of the other. Sums double or triple. The fix is to aggregate each child inside a CTE first, then join. If a metric jumps exactly 2x overnight, suspect fan-out first.
7. COUNT(column) when you meant COUNT(*)
SELECT COUNT(email) FROM users; -- 950 (50 users have NULL email)
SELECT COUNT(*) FROM users; -- 1000COUNT(column) skips NULL; COUNT(*) counts rows. Mix them up and your "active users" number quietly omits everyone who signed in with SSO and never set an email.
8. ORDER BY inside a subquery
-- Useless: ORDER BY in subquery is not guaranteed to survive
SELECT * FROM (
SELECT * FROM orders ORDER BY amount DESC
) sub
LIMIT 10;
-- Good
SELECT * FROM orders
ORDER BY amount DESC
LIMIT 10;The outer query is free to reorder rows; the SQL standard guarantees order only on the final SELECT. Snowflake and BigQuery do not preserve subquery order. Move ORDER BY to the outer query.
9. DISTINCT as a band-aid for a broken JOIN
-- Bad: DISTINCT hides the duplicate problem
SELECT DISTINCT u.user_id, u.name
FROM users u
JOIN orders o ON u.user_id = o.user_id;
-- Good: figure out why duplicates appear and fix the JOIN
SELECT u.user_id, u.name
FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.user_id);DISTINCT is expensive — it sorts and deduplicates the whole result. When you use it to cover up fan-out, you are paying for a problem you should have prevented. Use EXISTS for "does any matching row exist." See DISTINCT vs GROUP BY.
10. LIKE with a leading wildcard
-- Slow: leading % forces a full table scan
SELECT * FROM products WHERE name LIKE '%laptop%';
-- Faster: a prefix LIKE can use a B-tree index
SELECT * FROM products WHERE name LIKE 'laptop%';B-tree indexes are sorted alphabetically, so 'laptop%' becomes a range scan. A leading wildcard cannot use that ordering. For substring search you need a trigram index (pg_trgm) or a dedicated search engine.
11. Wrong GROUP BY
-- Bad: name is neither aggregated nor in GROUP BY
SELECT user_id, name, COUNT(*) AS orders
FROM orders
GROUP BY user_id;
-- Good
SELECT user_id, MAX(name) AS name, COUNT(*) AS orders
FROM orders
GROUP BY user_id;Standard SQL requires every non-aggregated SELECT column to appear in GROUP BY. MySQL with ONLY_FULL_GROUP_BY off returns an arbitrary value per group — reproducible until the day it is not. Include the column in GROUP BY or wrap it in MIN, MAX, or ANY_VALUE.
12. Wrapping an indexed column in a function
-- Bad: index on created_at is unusable
SELECT * FROM orders WHERE EXTRACT(YEAR FROM created_at) = 2025;
-- Good: range scan uses the index
SELECT * FROM orders
WHERE created_at >= '2025-01-01' AND created_at < '2026-01-01';Calling EXTRACT, DATE, or LOWER on an indexed column disables the index — the optimizer would need to evaluate the function for every row. Rewrite as an inequality range, or create a function-based index. Inspect with EXPLAIN.
13. UPDATE or DELETE without WHERE
-- Disaster
UPDATE users SET role = 'admin';
DELETE FROM orders;
-- Safe: WHERE plus an explicit transaction
BEGIN;
UPDATE users SET role = 'admin' WHERE user_id = 42;
-- inspect, then COMMIT or ROLLBACKObvious until you ship it on a Friday. Build the habit: every UPDATE or DELETE starts as a SELECT with the same WHERE clause, then you rewrite the keyword. Wrap destructive statements in a transaction so a typo can be rolled back.
14. Chained OR instead of IN
-- Verbose
SELECT * FROM orders
WHERE status = 'pending' OR status = 'processing' OR status = 'shipped';
-- Concise
SELECT * FROM orders
WHERE status IN ('pending', 'processing', 'shipped');IN reads cleaner and the optimizer can sometimes convert it to a hash lookup. With long OR chains the planner may fall back to a sequential scan.
15. Correlated subquery in SELECT
-- Bad: subquery re-runs for every row in users
SELECT
u.user_id,
u.name,
(SELECT COUNT(*) FROM orders o WHERE o.user_id = u.user_id) AS order_count
FROM users u;
-- Good: LEFT JOIN with GROUP BY
SELECT u.user_id, u.name, COUNT(o.order_id) AS order_count
FROM users u
LEFT JOIN orders o ON u.user_id = o.user_id
GROUP BY u.user_id, u.name;A correlated subquery in SELECT runs once per outer row — O(n x m). A LEFT JOIN plus GROUP BY lets the optimizer pick a hash or merge join, usually orders of magnitude faster.
Common pitfalls
The most common trap when auditing SQL is fixing the visible symptom instead of the cause. A dashboard shows a duplicated count, someone slaps DISTINCT on top, the number returns to plausible, and the fan-out hides until a different metric breaks on the same join. Investigate every DISTINCT in a draft query and confirm the join cardinality is truly many-to-many.
Another trap is treating NULL as a normal value. NULL is not equal to NULL, not unequal to NULL, and any NULL in a boolean expression produces UNKNOWN. Every comparison must use IS NULL or IS NOT NULL. Ask "what happens if this column is NULL" before merging any query.
The third pitfall is trusting that "it ran fast last week" means "it runs fast tomorrow." A query scanning 10M rows today scans 100M next quarter. Run EXPLAIN ANALYZE on every query touching a growing fact table; a Seq Scan with high cost, a Nested Loop on hundreds of inner rows, or a Sort spilling to disk all mean redesign before shipping.
Finally, watch type coercion. Comparing a string column to an integer forces a per-row cast and disables the index. Comparing timestamps in different time zones produces wrong results that look right until daylight savings.
Optimization tips
After eliminating antipatterns, the next layer of speed is physical design. Add covering indexes on the columns dashboards filter and group by — a composite index on (user_id, order_date) lets "orders for user 42 in March" skip the table entirely. Avoid more than five or six indexes per OLTP table; each slows inserts.
For Snowflake, BigQuery, or Databricks, partition large fact tables on your most common filter, usually a date. Cluster on the second most common filter — typically user_id — so the engine prunes within a partition. Cache hot aggregates with materialized views when the same query runs many times per hour. Always confirm with EXPLAIN that the optimizer picked the plan you expected.
Interview-style questions
Why does NOT IN with a NULL in the subquery return zero rows?
x NOT IN (1, NULL) expands to x != 1 AND x != NULL. The second comparison is UNKNOWN because NULL is not equal to anything. UNKNOWN combined with TRUE under AND collapses to UNKNOWN, which the engine treats as not-true, so every row fails the filter. NOT EXISTS uses different semantics and handles NULL correctly.
How do you prevent fan-out when joining two one-to-many tables to the same parent? Aggregate each child table inside a separate CTE keyed on the parent ID, then LEFT JOIN the parent to each aggregated CTE. The aggregation collapses the many side to one row per parent, so the joins become one-to-one and rows do not multiply.
Why is SELECT * dangerous in a dbt model or dashboard view? It binds your downstream contract to the current physical schema. When someone adds a column upstream — a JSON blob, a PII field, or a name that collides with an alias — the model either picks it up silently or breaks on the next type mismatch. Explicit column lists make the dependency visible to lineage tools and the next reader.
Related reading
- CTE with SQL guide
- GROUP BY and HAVING
- DISTINCT vs GROUP BY
- NULL in SQL guide
- How to read EXPLAIN ANALYZE
- SQL window functions interview questions
If you want to drill antipattern-spotting on real interview snippets, NAILDD ships with 500+ SQL problems including dedicated "find the bug" challenges modeled on screens used at Stripe, Snowflake, and DoorDash.
FAQ
Which antipattern shows up most often in interviews?
Integer division on a conversion rate and NULL in a NOT IN subquery tie for first. Both look correct at a glance, both produce a plausible-but-wrong result, and both can be caught in thirty seconds if you have seen them before. Interviewers love them because they reward pattern recognition over memorized syntax.
Is DISTINCT ever the right answer?
Yes, when source data genuinely contains duplicates you cannot eliminate upstream — a raw events log where a flaky client writes the same event twice. DISTINCT or ROW_NUMBER() OVER (PARTITION BY ...) = 1 on the event ID is correct. The antipattern is using DISTINCT to mask fan-out from a JOIN you wrote.
How do I get faster at spotting antipatterns?
Read other people's SQL. Pair on query reviews with a senior and ask them to think aloud. When you find a bug in production, write a short note about the shape that gave it away. After a few months the patterns become muscle memory.
Do these apply across PostgreSQL, MySQL, Snowflake, and BigQuery?
The semantic ones — NULL behavior, GROUP BY rules, integer division — are SQL standard, with MySQL most likely to let bad queries run silently. Performance traps depend on engine internals; leading-wildcard LIKE is slow on every B-tree engine and suboptimal on columnar engines too. Always run EXPLAIN on your target engine.