SQL for data analysts in 2026
Contents:
Why SQL is the analyst's number one skill
If you are interviewing for a data analyst role at Stripe, DoorDash, Airbnb, or Notion, the recruiter screen will mention SQL within the first two minutes. The technical loop leans on it for at least one full round, and the take-home will be a SQL exercise. Python and statistics matter, but they do not get you past the first gate. SQL does.
The reason is structural. Company data lives in relational warehouses: Snowflake, BigQuery, Redshift, Databricks, Postgres for product backends, ClickHouse for analytics. Every dashboard and weekly business review starts with somebody writing a query. SQL is asked in roughly 95% of analyst loops; Python in 40 to 60%. If prep time is tight, SQL is not a tie with Python — it is the priority.
Topics by level: junior, mid, senior
Junior
To pass an entry-level loop you need to write basic queries without hesitating. Concretely: SELECT and WHERE for filtering, the four JOIN flavors — INNER, LEFT, RIGHT, FULL — and a clear understanding of when each one is correct. You also need GROUP BY with HAVING, ORDER BY and LIMIT, and the aggregate functions COUNT, SUM, AVG, MIN, MAX.
The concept that catches juniors most often is the logical execution order. The engine runs FROM, then JOIN, then WHERE, then GROUP BY, then HAVING, then SELECT, then DISTINCT, then ORDER BY, then LIMIT. This is why you cannot reference a SELECT alias inside WHERE, but you can in ORDER BY. Expect to be asked.
-- Junior-level pattern: filter, join, aggregate
SELECT
u.country,
COUNT(DISTINCT o.user_id) AS buyers,
SUM(o.amount_usd) AS revenue
FROM orders o
JOIN users u ON u.user_id = o.user_id
WHERE o.created_at >= DATE '2026-01-01'
AND o.status = 'paid'
GROUP BY u.country
HAVING COUNT(DISTINCT o.user_id) >= 50
ORDER BY revenue DESC
LIMIT 20;If you can write that query without looking anything up, and explain what each clause does and in what order it runs, you are ready for a junior loop.
Mid
Mid-level is where the interview goes from "can you query a table" to "can you reason about a dataset". The headline topic is window functions: ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, SUM OVER, AVG OVER, plus frame clauses like ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Window functions are the single most popular topic on mid-level analyst interviews — see the SQL window functions interview questions drill set for the exact patterns that come up.
The other mid skills are CTEs with WITH for readable multi-step queries, the trade-offs between CTE and subquery, CASE WHEN for conditional logic, date-and-time functions (DATE_TRUNC, DATE_PART, EXTRACT, interval arithmetic), and NULL handling with COALESCE and NULLIF. You should also be able to articulate the difference between GROUP BY and PARTITION BY without thinking about it — the former collapses rows, the latter keeps them.
-- Mid-level pattern: window function for "rank within group"
WITH ranked_orders AS (
SELECT
user_id,
order_id,
amount_usd,
created_at,
ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY created_at
) AS order_seq
FROM orders
WHERE status = 'paid'
)
SELECT user_id, order_id, amount_usd
FROM ranked_orders
WHERE order_seq = 1; -- each user's first paid ORDERSenior
Senior-level expectations are about judgment, not new syntax. Can you read an EXPLAIN ANALYZE plan and tell which JOIN strategy the planner picked and why it was wrong? Do you know when a B-tree index helps and when a query skips it because of an implicit cast? Have you decomposed revenue across cohort, frequency, and AOV and shipped the result to a PM by Monday morning?
Interview questions at this level are open-ended: "design a query for weekly active creators with a 14-day rolling window, and tell me how to make it cheap on a billion-row events table". The answer involves partition pruning, pre-aggregation in a CTE, a careful window frame, and a back-of-the-envelope cost estimate. None of that is in a cheat sheet. It comes from reps on real warehouse data.
What analysts do not need to learn
It is tempting to read every chapter of every textbook. Resist it. Analysts do not need DDL beyond a passing read of CREATE TABLE — you work on tables that already exist. You do not need stored procedures, triggers, replication, sharding, or backup configuration. None of that comes up in analyst interviews and none of it helps you ship a dashboard. That time should go to window functions and complex aggregations on real datasets.
Which database engine to study
PostgreSQL is the safe default. The vast majority of analyst interviews run on Postgres-style syntax, and almost every other engine speaks a close dialect. Learn Postgres first and you can read 90% of queries in any warehouse.
On the job, you will encounter Snowflake at growth-stage and enterprise companies, BigQuery at GCP shops, Redshift at AWS-heavy stacks, Databricks SQL where the data lake is the warehouse, and ClickHouse for high-cardinality event analytics at Uber and large media properties. The syntax differs in details — Snowflake's QUALIFY, BigQuery's array functions, ClickHouse's array joins — but the core is the same. Pick up the dialect of your employer in your first week.
Real tasks analysts solve in SQL
The same five or six problem shapes come up over and over, which is why interviewers test the same primitives. You will compute D1, D7, D30 retention. You will build conversion funnels from signup to first purchase. You will run cohort analysis to see whether last quarter's users behave differently from this quarter's. You will decompose a revenue drop into "fewer users, lower AOV, lower frequency" and tell the PM which one moved. You will compute A/B test deltas by variant. You will answer ad-hoc questions like "how many users from California bought a subscription in the last 28 days".
Every one of those tasks reduces to JOIN plus GROUP BY plus a window function plus, sometimes, a CTE. That is exactly why interviews test those four primitives — they cover the working analyst's day.
Common pitfalls
The trap that catches the most candidates is studying syntax instead of execution. It is easy to write SELECT * FROM users WHERE active = TRUE without understanding that the engine evaluated FROM before WHERE. The cost shows up when a query returns the wrong answer and you cannot reason about why. Learn the logical evaluation order once and rehearse it whenever you debug — read clauses in execution order, not writing order, and the bug usually surfaces inside a minute.
A close second is treating NULL like zero or empty string. NULL is neither. NULL = NULL returns NULL, not TRUE. A WHERE clause on a nullable column silently drops rows, and a JOIN on nullable keys will not match them. The fix is two habits: use IS NULL and IS NOT NULL when you mean it, and wrap nullable columns in COALESCE when the business rule says "treat missing as something specific".
The third pitfall is mixing up WHERE and HAVING. WHERE filters rows before aggregation, HAVING filters groups after aggregation, and you cannot use aggregate functions in WHERE. Candidates who write WHERE COUNT(*) > 5 and get confused when the database errors are tipping their hand. WHERE is row-level, HAVING is group-level.
The fourth is avoiding CTEs out of habit. Many self-taught analysts write deeply nested subqueries because that is what they saw first. A four-level subquery is unreadable to your future self and to your reviewer. The same logic as four named CTEs reads top to bottom like prose, and the performance difference on modern planners is negligible at analyst scale.
The fifth is fear of window functions. They look intimidating; they are not. Start with ROW_NUMBER and LAG, then SUM OVER, then frames. After ten well-chosen practice problems they stop being mysterious. Candidates who postpone window functions "until later" fail mid-level loops.
An 8-week study plan
Weeks 1 and 2: SELECT, WHERE, the four JOINs, GROUP BY and HAVING. Drill the JOIN cheat sheet and the GROUP BY guide until they are automatic. Solve 5 to 10 short queries per day. By week 2's end you should not be looking up JOIN syntax.
Weeks 3 and 4: subqueries, CTEs, CASE WHEN, date functions, NULL handling. Read the CTE guide, then refactor every messy subquery from weeks 1 and 2 into CTEs.
Weeks 5 and 6: window functions. Start with ROW_NUMBER and LAG, then SUM OVER and frame clauses. Aim for 30 to 40 window-function problems by week 6's end.
Weeks 7 and 8: end-to-end problems. Build a retention curve. Build a funnel. Decompose a revenue change. Compute A/B test deltas with confidence intervals. By the end of week 8 you can sit a mid-level interview without anxiety.
If you want a curated daily problem queue across this exact curriculum, NAILDD ships 500+ SQL problems organized by topic and difficulty — built to match what actually shows up in analyst loops.
Interview questions and short answers
Difference between WHERE and HAVING? WHERE filters rows before GROUP BY runs; HAVING filters groups after. You cannot use aggregates in WHERE because the groups do not exist yet. HAVING COUNT(*) > 5 keeps only groups with more than five rows.
LEFT JOIN vs INNER JOIN? INNER returns rows with a match in both tables. LEFT returns all rows from the left table plus matches from the right, filling unmatched right-side columns with NULL. Use LEFT when you need to keep every row from one side even if no match exists — for example, finding users with zero orders.
Window function vs GROUP BY? GROUP BY collapses rows into one per group. A window function computes a value over a group but keeps every row in the output. That is what lets you place a running total or per-user rank next to each row of detail.
How do you find duplicates? GROUP BY the columns that should be unique and filter HAVING COUNT(*) > 1. Or ROW_NUMBER() OVER (PARTITION BY those columns ORDER BY anything) and treat rows with row number > 1 as duplicates.
Execution order? FROM, JOIN, WHERE, GROUP BY, HAVING, SELECT, DISTINCT, ORDER BY, LIMIT. A SELECT alias is invisible to WHERE but visible to ORDER BY.
Related reading
- SQL window functions interview questions
- How to calculate cohort retention in SQL
- How to calculate D1, D7, D30 retention in SQL
- Statistics for data analysts
- Group by vs partition by in SQL
- CTE vs subquery in SQL
FAQ
How long does it take to learn SQL well enough for an analyst role?
Four to eight weeks of daily 30-to-60-minute sessions for a candidate with no prior database experience. The first two weeks cover SELECT, JOIN, GROUP BY; the next two add subqueries, CTEs, and date logic; the last two-to-four weeks are window functions and end-to-end problems. Daily reps matter far more than weekend cramming — interview pressure feels manageable only when the syntax is automatic.
Is SQL or Python more important for a data analyst?
SQL, by a large margin. Roughly 95% of analyst loops include a live SQL round; Python shows up in 40 to 60%. SQL also covers about 80% of on-the-job work: every dashboard, cohort report, and ad-hoc business question starts there. Add Python after you can write window-function-heavy SQL on a whiteboard without hesitating.
Are online SQL practice platforms enough to prepare?
They are excellent for drilling syntax and common patterns. But platforms alone leave gaps in messy real-world skills: reading an existing 200-line query, picking the right window frame for a business question, debugging "almost right" answers. Combine platform drills with real exercises on a sample dataset — compute retention, build a funnel, decompose revenue — and you will outclass candidates who only grind exercises.
Which SQL dialect should I learn first?
PostgreSQL. Interview questions are usually framed in Postgres syntax, and every other warehouse dialect — Snowflake, BigQuery, Redshift, Databricks SQL — is close enough to read after learning Postgres. Pick up the on-the-job dialect when you start at a company; the core skill transfers immediately.
What is the fastest way to improve my SQL interview performance?
Solve one window-function problem every day for 30 days, out loud, against a timer. Window functions are the highest-leverage topic on mid-level loops and the one candidates most often skip. Thirty problems make ROW_NUMBER, LAG, LEAD, and frame clauses feel routine — which is exactly the bar interviewers test for.