SQL for the data engineer interview
Contents:
Why DE-SQL is not analytics-SQL
Analytics SQL is judged on whether the number you ship answers the business question. Data engineering SQL is judged on whether the same query still finishes when the table doubles, survives an hourly schedule, and can re-run yesterday without producing duplicates. Same dialect, very different bar.
The phrasing of the question is your first hint. An analyst at Stripe gets asked "write thirty-day retention". A data engineer at Stripe gets asked "write thirty-day retention so it runs hourly on a hundred-billion-row events table without scanning the whole partition history". The hiring manager wants to see incrementality, an awareness of cost, and a defensible execution plan before they care whether the join is left or inner.
Across DoorDash, Snowflake, Databricks, Airbnb, and Uber DE loops the blocks are nearly identical: optimization and EXPLAIN reading, partitioning under filters, window edge cases like RANGE BETWEEN and LAG defaults, transactions and isolation for long batches, and a few sharp antipatterns thrown in — Cartesian joins, dependent subqueries, integer division, and ROW_NUMBER in the WHERE clause.
Reading EXPLAIN and the query plan
The most common live-coding moment in a DE loop is "here is a query that runs in five minutes, get it under thirty seconds". The interviewer drops an EXPLAIN ANALYZE plan on the shared screen and watches where you look first. The wrong move is to start tuning indexes; the right move is to read the plan top-down and find the operator burning the most actual time.
What to flag when you scan a Postgres plan. A Seq Scan on a billion-row table almost always means the index was skipped, the filter was non-sargable, or no useful index exists. A large Sort node is a candidate for an index on the ORDER BY columns or for materializing a pre-sorted intermediate. A Nested Loop on two large inputs is often the planner trusting bad statistics; rerun ANALYZE and check whether Hash Join returns. A gap between estimated rows and actual rows is the universal symptom of stale statistics, and on partitioned tables it usually means the planner cannot tell which partitions to prune. The Buffers line — shared hit versus shared read — tells you whether you are bound by cache or by disk; mostly read means cold I/O and points to a covering index or materialized aggregate.
A clean "make it faster" answer walks the interviewer through the loop in order: run EXPLAIN, find the most expensive operator by actual time, classify it as a scan, join, or sort problem, eliminate unnecessary joins and unnest correlated subqueries, verify that pruning is happening, add a covering or partial index only when the plan calls for it, and materialize an intermediate result as a last resort. The candidate who jumps to "add an index" without reading the plan does not get the offer.
Advanced window functions
A senior DE loop will not ask for ROW_NUMBER. It will ask for edge cases — the parts that break under partitioning, nulls, or sparse data.
Rolling windows with RANGE BETWEEN
The distinction between ROWS BETWEEN and RANGE BETWEEN is a favorite trap. ROWS counts physical rows; RANGE counts logical distance in the ORDER BY value. If event_date has gaps — and on real data it always does — ROWS BETWEEN 6 PRECEDING AND CURRENT ROW quietly includes rows from further back than seven days.
SELECT
user_id,
event_date,
amount,
SUM(amount) OVER (
PARTITION BY user_id
ORDER BY event_date
RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW
) AS rolling_7d_sum
FROM events;Gap-fill for missing days
Standard DE task: "the event log has missing days, fill them with zeros so the chart in Looker does not lie". Pattern: generate_series for the calendar spine, LEFT JOIN, COALESCE.
WITH dates AS (
SELECT generate_series(
'2026-01-01'::DATE,
'2026-01-31'::DATE,
INTERVAL '1 day'
)::DATE AS day
),
daily_events AS (
SELECT day, COUNT(*) AS cnt
FROM event_log
WHERE day BETWEEN '2026-01-01' AND '2026-01-31'
GROUP BY day
)
SELECT
d.day,
COALESCE(e.cnt, 0) AS cnt
FROM dates d
LEFT JOIN daily_events e USING (day)
ORDER BY d.day;Deduplication with ROW_NUMBER in a CTE
Window functions are not allowed in WHERE. The interviewer wants to see that you know this and reach for a CTE without prompting.
-- broken: window functions can't appear in WHERE
SELECT *
FROM users
WHERE ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) = 1;
-- correct: rank in a CTE, then filter
WITH ranked AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY email ORDER BY created_at DESC) AS rn
FROM users
)
SELECT *
FROM ranked
WHERE rn = 1;Partitioning and pruning
DEs own the schema, so the interviewer will probe whether you can design partitioning that pays off. Expect questions about the partition key (almost always the event date or a derived bucket like event_month), range versus list partitioning, what partition pruning is, and how it changes the plan from the previous section.
The trap question sounds like an analyst complaint. "We have a fifty-billion-row events table partitioned by event_date. Dashboards are slow. What do you check?" The answer the interviewer is listening for is pruning. A filter like event_date = '2026-05-01' lets the planner skip every other partition; DATE_TRUNC('day', event_date) = '2026-05-01' does not — wrapping the partition key in a function defeats pruning, and the planner scans every partition before discarding ninety-nine percent of rows. Same with implicit casts, event_date::text LIKE '2026-05-%', and UDF-wrapped comparisons. Fix the predicate, do not "add an index".
Transactions and isolation
DE batches run for an hour and must not corrupt anything along the way. Interviewers ask about isolation levels because they distinguish people who have shipped a backfill from people who have only read the Postgres docs.
The short tour: Read Uncommitted barely exists in Postgres; skip it. Read Committed is the default; each query sees only committed data, but two reads in the same transaction can return different results — the classic non-repeatable read. Repeatable Read takes a snapshot at transaction start, implemented in Postgres as Snapshot Isolation, which also prevents phantom reads in practice. Serializable can abort transactions on conflicts, so any batch using it must be safe to retry.
The follow-up is a scenario: "you start a one-hour batch on the orders table. In parallel, a service is running UPDATEs on the same rows. What does your batch see under Read Committed versus Repeatable Read?" Under RC the batch sees a moving target — different SELECTs may return different values as the service commits. Under RR your batch is pinned to the snapshot it saw on BEGIN, which is what you want for reproducibility, at the cost of holding the snapshot open and pressuring vacuum.
Interview tasks you will actually see
Optimize a slow query
The shared screen shows either the SQL or its EXPLAIN. Walk the plan, name the operator burning time, then propose a fix specific to that operator — not a generic "add an index". If the bottleneck is a sort and the ORDER BY column already has an index, the fix may be to reorganize the query so the sort is avoided entirely.
Find duplicates and explain their origin
A staple: "the events table has duplicates on (user_id, event_id). Find them, quantify the rate, explain where they came from."
SELECT
user_id,
event_id,
COUNT(*) AS dup_count
FROM events
GROUP BY user_id, event_id
HAVING COUNT(*) > 1;The senior signal is in the second half — naming root causes. Producer retries without idempotency keys, at-least-once delivery from Kafka without dedup downstream, and backfill re-runs that lack ON CONFLICT handling are the three most common culprits.
Incremental aggregate pipeline
"The events table is append-only. Every hour, advance the hourly aggregate table with the new rows. Write the SQL."
INSERT INTO events_hourly_agg (hour, user_id, cnt)
SELECT
DATE_TRUNC('hour', event_time) AS hour,
user_id,
COUNT(*) AS cnt
FROM events
WHERE event_time >= (SELECT MAX(hour) FROM events_hourly_agg)
GROUP BY 1, 2
ON CONFLICT (hour, user_id) DO UPDATE
SET cnt = EXCLUDED.cnt;The interviewer watches for three things: a watermark from the target table so you do not reprocess the whole history, an explicit ON CONFLICT clause so late-arriving rows update rather than duplicate, and awareness that DATE_TRUNC on the partition key is fine in SELECT but would break pruning in the WHERE.
Slowly changing dimensions (SCD type 2)
"customer_history has valid_from and valid_to. Show the version of each customer active on 2026-01-15."
SELECT *
FROM customer_history
WHERE valid_from <= '2026-01-15'
AND (valid_to > '2026-01-15' OR valid_to IS NULL);The trap is the NULL on the open-ended row — the current version usually has valid_to IS NULL, and forgetting that branch silently excludes every active customer.
Common pitfalls
The most damaging habit in a production pipeline is reaching for SELECT *. The first time the upstream team adds a column, your downstream consumer either breaks or ingests a column it does not know how to handle. Always materialize an explicit column list; the cost of writing it once is far smaller than debugging a silent schema drift at 3am.
A second pitfall is the LEFT JOIN that quietly becomes an INNER JOIN. The moment you write WHERE right_table.column = 'something', you have filtered out every left row that had no match — the NULLs are gone, and the LEFT is now an INNER. If the intent is to preserve unmatched left rows, the filter belongs in the ON clause, not the WHERE. This is the single most common DE bug in code review.
COUNT(DISTINCT) inside an OVER clause is a frequent trap on Postgres because it is not supported. Candidates write it, the query errors, and they freeze. The workaround is a correlated subquery, a CTE that pre-aggregates the distinct count per window key, or APPROX_COUNT_DISTINCT on engines that offer it — Snowflake, BigQuery, and ClickHouse all do.
Integer division is the silent killer of conversion-rate metrics. 5 / 20 * 100 evaluates to zero in Postgres because both operands are integers and the result is truncated before the multiplication. Cast on the way in: 5::NUMERIC / 20 * 100 = 25. Pair with NULLIF on every denominator that could legitimately be zero — value / NULLIF(denom, 0) is the standard idiom and prevents one bad row from aborting the whole batch.
Finally, EXPLAIN without ANALYZE only shows the planner's estimate, not actual runtime. Estimates can be off by orders of magnitude with stale statistics; always rerun with EXPLAIN (ANALYZE, BUFFERS) to get real numbers and cache behavior. A plan that looks cheap but reads twenty gigabytes from disk is still slow.
Related reading
- How to read EXPLAIN ANALYZE
- SQL antipatterns for data engineering interviews
- Window functions for data engineering interviews
- Table partitioning in SQL
- Materialized views in SQL
To drill the exact tasks above every day, the NAILDD app is launching with 500+ SQL problems tagged for data engineering, including EXPLAIN drills and incremental-pipeline scenarios.
FAQ
Which SQL dialect should I prepare for a data engineer interview?
Postgres is the safe baseline and the dialect most generic DE loops use for shared whiteboarding. If the team runs a specific stack, prepare that too — Snowflake and BigQuery for warehouse-heavy teams, Spark SQL and Trino for lakehouses, ClickHouse for real-time analytics, and Hive only if you are told the codebase is legacy. Syntax differences are minor; what matters is knowing the cost model of the engine you will write against.
How many SQL problems should I solve before the loop?
Fifty to a hundred problems across difficulty bands is the working number for a senior loop. Volume on easy problems is not the bottleneck — most candidates can write a GROUP BY in their sleep. The drills that move the needle are medium-to-hard problems with windows, CTEs, ON CONFLICT semantics, and unfamiliar EXPLAIN plans. Tag your practice so the last week targets your weakest band.
Do interviewers care about the difference between RANGE and ROWS?
Yes, but not as trivia. It is a proxy for whether you have written rolling-window code on sparse data and seen it break. Explain the failure mode in one sentence — "ROWS counts physical rows, so gaps in the date column silently widen the window" — and you have signaled production experience. Stumble on it and the interviewer assumes you only used windows on synthetic, dense data.
How deep do they go on transactions and isolation?
Deep enough to distinguish levels by failure mode, not definition. Reciting "Read Committed sees committed data" is not enough; you need to predict what a long batch sees when a service updates rows underneath it, and argue for or against Repeatable Read in a specific scenario. Roles owning a transactional warehouse load can expect a follow-up on Serializable conflicts and idempotent retries.
Are dialect-specific functions fair game?
Mostly yes, scoped to the team's stack. If the JD mentions Snowflake, expect QUALIFY, LATERAL FLATTEN, and the STREAM/TASK model. If it mentions ClickHouse, expect ANY joins, ARRAY JOIN, and the materialized view ingest pattern. Ask the recruiter which dialect the live coding uses — preparing for the right engine signals more than generic ANSI fluency.
Is this content official?
No. It reflects publicly reported interview patterns from DE candidates at Snowflake, Databricks, Airbnb, Uber, and DoorDash, plus open-source docs from Postgres, ClickHouse, and BigQuery.