Anti and semi joins for Data Engineer interviews

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why interviewers ask this

Anti and semi joins are the bedrock of correct filtering by another table. Every Data Engineer loop at Stripe, Snowflake, Databricks, Airbnb and Uber pokes at this — usually phrased as "find users with no orders", "what's the difference between NOT IN and NOT EXISTS", and "why does NOT IN silently return zero rows". Senior screens push further into physical plans, broadcast vs sort-merge decisions, and skew behavior on distributed engines.

The biggest production failure mode is dull and very common: an engineer writes NOT IN, the subquery returns a single NULL, and the whole result set collapses to zero rows. The pipeline keeps running. Downstream marketing emails go out to no one. Analytics catches it a week later — usually after a Monday metric review where the chart looks weirdly flat. This is the kind of bug that doesn't throw, doesn't page, and lives in your warehouse for as long as you let it.

If a candidate can articulate the three-valued logic of NULL and pick NOT EXISTS without hesitation, that's a strong signal. If they reach for LEFT JOIN ... IS NULL and explain when it's faster than NOT EXISTS, that's senior-shaped.

What is a semi join

A semi join answers: "give me rows of A for which at least one matching row exists in B". It returns only A's columns and never duplicates A by the number of matches in B.

Standard SQL has no SEMI JOIN keyword. The behavior is expressed through EXISTS or IN:

-- users who have at least one order
SELECT u.*
FROM users u
WHERE EXISTS (
  SELECT 1 FROM orders o WHERE o.user_id = u.id
);

-- equivalent via IN
SELECT u.*
FROM users u
WHERE u.id IN (SELECT user_id FROM orders);

Spark SQL and Hive expose it directly: SELECT * FROM users LEFT SEMI JOIN orders ON users.id = orders.user_id. Trino, Flink and Databricks SQL accept the same syntax.

Why not INNER JOIN? An inner join against orders blows users up by the order count — one row per (user, order) pair — and you'd need a DISTINCT to dedupe. A semi join guarantees exactly one row per user without paying the dedup cost.

What is an anti join

An anti join is the mirror: "give me rows of A for which no matching row exists in B".

-- users with no orders
SELECT u.*
FROM users u
WHERE NOT EXISTS (
  SELECT 1 FROM orders o WHERE o.user_id = u.id
);

Spark and Hive: LEFT ANTI JOIN. Databricks, Trino, Snowflake and BigQuery accept the same. Two valid alternatives in vanilla SQL:

-- LEFT JOIN + IS NULL (sometimes called "outer anti")
SELECT u.*
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.user_id IS NULL;

-- NOT IN (dangerous with NULL — see below)
SELECT u.*
FROM users u
WHERE u.id NOT IN (SELECT user_id FROM orders);

Load-bearing rule: for anti join, default to NOT EXISTS. Use LEFT JOIN ... IS NULL only when you've benchmarked it. Use NOT IN only when the subquery column is provably NOT NULL at the schema level.

EXISTS vs IN vs JOIN

Three patterns, three different shapes:

Pattern NULL-safe Stops at first match Risk of row blow-up Idiomatic for
EXISTS (correlated) Yes Yes None Semi/anti filters
IN (subquery) Safe for IN, unsafe for NOT IN Yes None Small lookup lists
JOIN ... DISTINCT Yes No High (cartesian-ish blow-up before dedup) Almost never for semi joins

EXISTS short-circuits — the engine stops scanning B as soon as one match is found. It handles NULL correctly because it's a boolean existence check, not a comparison.

IN with a subquery is semantically equivalent to EXISTS in modern planners. Postgres 12+, Snowflake, BigQuery and Spark 3+ typically rewrite it to the same physical plan. The footgun is only in the negated form.

JOIN + DISTINCT is the antipattern. It materializes the cartesian-style product first, then dedups — a waste on billions of rows.

-- bad: blow up users by order count, then dedupe
SELECT DISTINCT u.*
FROM users u
JOIN orders o ON o.user_id = u.id;

-- good
SELECT u.*
FROM users u
WHERE EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id);

On modern optimizers the runtime gap often disappears, but writing EXISTS signals intent. Code review at Snowflake or Databricks will flag the DISTINCT version.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

NOT EXISTS vs NOT IN vs LEFT JOIN IS NULL

This is the textbook interview trap.

NOT IN with NULL — silently returns zero rows.

SELECT u.*
FROM users u
WHERE u.id NOT IN (SELECT user_id FROM orders);
-- If orders.user_id contains a single NULL → empty result set

The reason is three-valued logic. 1 NOT IN (2, NULL) evaluates to 1 != 2 AND 1 != NULL = TRUE AND UNKNOWN = UNKNOWN. SQL's WHERE clause keeps only rows where the predicate is TRUE, so the whole query returns nothing. No error, no warning — just an empty set.

NOT EXISTS — always safe:

SELECT u.*
FROM users u
WHERE NOT EXISTS (
  SELECT 1 FROM orders o WHERE o.user_id = u.id
);
-- Correct even if orders.user_id has NULLs

LEFT JOIN ... IS NULL — also safe, sometimes faster:

SELECT u.*
FROM users u
LEFT JOIN orders o ON o.user_id = u.id
WHERE o.user_id IS NULL;

The rule of thumb across every dialect:

Sanity check: for anti joins, always use NOT EXISTS or LEFT JOIN ... IS NULL. Reach for NOT IN only when the subquery column has a NOT NULL constraint enforced at the schema level — and even then, document the assumption inline.

Performance in Postgres, Spark and Snowflake

Postgres. The planner converts NOT EXISTS into a Hash Anti Join or Nested Loop Anti Join automatically. The EXPLAIN looks like:

->  Hash Anti Join
      Hash Cond: (u.id = o.user_id)
      ->  Seq Scan on users u
      ->  Hash
            ->  Seq Scan on orders o

With an index on orders.user_id and a small users table, you'll see Nested Loop Anti Join using the index — generally the fastest plan when the outer relation is under ~10k rows.

Spark. The physical operator is BroadcastHashJoin (LeftAnti) when the right side fits in driver memory (default broadcast threshold 10 MB, tunable up to a few hundred MB), or SortMergeJoin (LeftAnti) otherwise. The classic failure mode is skewed keys — say one user with a million orders — which slams a single executor. Salting the join key or using spark.sql.adaptive.skewJoin.enabled=true is the standard fix.

Snowflake. The optimizer treats NOT EXISTS and LEFT ANTI JOIN identically and pushes them down through micro-partitions. Clustering on the join key matters more than which syntax you pick. For tables above ~100M rows, the win comes from clustering, not from the join keyword.

BigQuery. Hash anti joins are the default. Beware of NOT IN rewrites — BigQuery's planner is good but still inherits the NULL semantics, so the syntax choice changes correctness, not just speed.

ClickHouse. Supports ANTI JOIN and SEMI JOIN natively since version 22+. For large anti filters, LEFT ANTI JOIN consistently beats emulation via NOT IN, sometimes by 3-5x wall-clock on the billion-row range.

Common pitfalls

The most common pitfall, and the one that ships to production silently, is NOT IN over a nullable subquery. The fix is two-fold: rewrite as NOT EXISTS, or explicitly filter WHERE user_id IS NOT NULL inside the subquery. Reviewing the data contract upstream is the durable fix — if orders.user_id should never be null, make it NOT NULL at the schema level and stop relying on hope.

Another trap is NOT EXISTS without a correlated predicate. Writing NOT EXISTS (SELECT 1 FROM orders) checks whether the orders table is empty at all — it's not tied to the outer row. The query will either return every user (if orders is empty) or none. Always include WHERE o.user_id = u.id to correlate to the outer query.

A third is using LEFT JOIN without IS NULL when the intent was anti-join. Without the IS NULL filter you get a regular left outer join — every user, plus matched orders, plus nulls for unmatched. Engineers under interview pressure occasionally drop the filter and produce a result with the wrong cardinality, then add DISTINCT to "fix" it.

Treating DISTINCT after JOIN as free is another classic. On a billion-row orders table, the join inflates the working set into the tens of billions before the dedup, blowing through shuffle memory. EXISTS is almost always cheaper because it short-circuits.

The large literal IN list problem catches people too — WHERE id IN (1, 2, ..., 100000) chokes the query parser in Postgres and most engines. The fix is to load the list into a temporary table and join against it, or use WHERE id = ANY (ARRAY[...]) in Postgres which is parsed more efficiently.

Finally, ignoring the index on the right-side anti join key in Postgres. Without an index on orders.user_id, a nested-loop anti join does a sequential scan of orders per outer row. Hash anti join sidesteps this, but the index is still useful for selectivity estimates and other queries.

If you want to drill DE interview SQL patterns daily, NAILDD is launching with hundreds of problems built around exactly this kind of join taxonomy.

FAQ

Is EXCEPT an anti join?

Close, but not the same. SELECT id FROM users EXCEPT SELECT user_id FROM orders returns ids in users that don't appear in orders — but EXCEPT operates on all columns of the projected set and removes duplicates as part of its semantics. Anti joins are more flexible: they let you filter on complex predicates beyond equality (date ranges, status filters, multi-column conditions), and they don't force the deduplication. For a simple "id not in" check EXCEPT is fine; for anything more nuanced, reach for NOT EXISTS.

Does EXISTS actually stop at the first match?

Yes. The optimizer treats EXISTS as a boolean existence check, so once a single matching row is found in the subquery, evaluation stops for that outer row. On an indexed right-side column this is extremely fast — typically a single index seek per outer row in the nested-loop case, and a single hash probe in the hash anti join case.

What is a correlated subquery?

A subquery that references a column from the outer query. EXISTS (SELECT 1 FROM orders o WHERE o.user_id = u.id) correlates with u from the outer query. Modern planners (Postgres 12+, Snowflake, Spark 3+, BigQuery, Databricks) decorrelate these into proper anti or semi join operators during query rewriting, so there's no per-row execution cost in practice.

Can NOT IN ever be safe?

Yes, two ways. Filter nulls inside the subquery — NOT IN (SELECT user_id FROM orders WHERE user_id IS NOT NULL) — or enforce a NOT NULL constraint on the column at the schema level so the planner can prove no nulls exist. But the team-wide recommendation in most engineering orgs is to default to NOT EXISTS and not rely on remembering the rule.

Spark LEFT ANTI vs Postgres NOT EXISTS — same thing?

Semantically identical. Spark exposes anti join as an explicit JOIN keyword because it's a distributed engine where the physical plan choice matters; Postgres lets the planner do the rewrite automatically. The query result is the same, the difference is just where you spell the intent — in the JOIN clause or in the WHERE EXISTS predicate.

How do I benchmark these patterns?

For Postgres, EXPLAIN (ANALYZE, BUFFERS) to see the actual operator and rows. For Spark, the SQL tab in the Spark UI shows the physical plan and the per-stage data movement. For Snowflake, the query profile tab in the UI shows the operator tree and bytes scanned. Run all three variants — NOT EXISTS, LEFT JOIN ... IS NULL, NOT IN — on representative data sizes and pick the plan with the fewest rows materialized in the shuffle.

Is this official?

No. The patterns and gotchas are based on the SQL:1992 standard, Postgres 14+ documentation, Spark 3.x and ClickHouse 23+ behavior. Always test against the dialect and version you're shipping to.