DISTINCT vs GROUP BY in SQL
Contents:
- Why this question keeps showing up
- What DISTINCT actually does
- What GROUP BY actually does
- Side-by-side comparison
- When DISTINCT is the right tool
- When GROUP BY is the right tool
- When both produce the same result
- Performance and execution plans
- Common pitfalls
- Optimization tips
- Interview answers
- Related reading
- FAQ
Why this question keeps showing up
A recruiter at Stripe once told me that roughly a third of her SQL screens for analyst roles boil down to one prompt: "Tell me whether DISTINCT or GROUP BY is the right choice here." It separates candidates who memorized syntax from candidates who think in result shape. Picture a Monday at Airbnb: a PM pings you asking "how many unique users booked in March?" That answer and the answer to "what's the booking count by city in March?" use different SQL primitives, even when both queries touch the same table.
DISTINCT and GROUP BY both deduplicate, both reduce row counts, and both sit near the top of a SELECT clause. That overlap is what makes the topic interview gold. The interviewer is not really asking which keyword is faster — they are checking whether you can name the intent of a query out loud before typing. A senior analyst at Snowflake picks DISTINCT for "I want a deduplicated list" and GROUP BY for "I want a metric per category", and says so before writing.
What DISTINCT actually does
DISTINCT is a modifier on SELECT. It removes complete row duplicates, where "row" means the combination of every column listed in SELECT. If two output rows are identical across every projected column, DISTINCT collapses them into one.
SELECT DISTINCT city
FROM users;That returns the unique cities in users. If a million rows reference only fifty cities, the result is fifty rows.
A common misconception is that DISTINCT applies to one column. It does not — DISTINCT applies to the entire projected tuple, so listing two columns deduplicates on the combination:
SELECT DISTINCT city, department
FROM employees;New York + Engineering and New York + Marketing are two different rows. This trips up junior candidates who expect DISTINCT to ignore the second column.
DISTINCT also works inside aggregates. COUNT(DISTINCT user_id) counts unique non-null values — the right tool for "how many unique users did X". Standalone SELECT DISTINCT and embedded COUNT(DISTINCT ...) are related but not interchangeable; one reduces rows, the other produces a number.
What GROUP BY actually does
GROUP BY collapses rows into groups based on the columns you name, and lets aggregate functions compute one value per group. On its own, without an aggregate, GROUP BY is a code smell — the engine accepts it, but a reader assumes you forgot COUNT or SUM.
SELECT city, COUNT(*) AS user_count
FROM users
GROUP BY city;One row per city, carrying the row count. The shape is one-row-per-group, with the group key as the GROUP BY tuple.
The combination with multiple aggregates is what makes GROUP BY analytical:
SELECT department,
AVG(salary) AS avg_salary,
MAX(salary) AS max_salary,
COUNT(*) AS headcount
FROM employees
GROUP BY department;Without GROUP BY, those aggregates would compute over the whole table and produce a single row of company-wide averages. With GROUP BY, the same aggregates compute per department — the "metric sliced by dimension" pattern analysts at Meta or Linear write a dozen times a day.
GROUP BY also unlocks HAVING, which filters at the group level after aggregation. WHERE filters input rows before grouping; HAVING filters output rows after. DISTINCT has no notion of groups and cannot replicate that.
Side-by-side comparison
| Question | DISTINCT | GROUP BY |
|---|---|---|
| What is it for? | Remove duplicate output rows | Form groups so aggregates can compute |
| Aggregate functions allowed? | No (only inside COUNT(DISTINCT ...)) |
Yes, this is the whole point |
| Output shape | Unique tuples across SELECT | One row per group key |
| Filtering after the op? | No HAVING equivalent |
HAVING filters groups post-aggregation |
| Reads as | "Give me unique values" | "Give me a metric per category" |
Memorize the "reads as" row. In a code review or interview, that line is what you say first.
When DISTINCT is the right tool
Reach for DISTINCT when the deliverable is a deduplicated list with no metric attached — feeding a dashboard filter dropdown, populating an enum, listing existing categories.
SELECT DISTINCT utm_source
FROM signups
WHERE utm_source IS NOT NULL
ORDER BY utm_source;Notice the explicit NULL filter — DISTINCT treats NULL as its own value and will happily include it, which is rarely what a frontend dropdown wants.
DISTINCT inside COUNT is the other heavy-use case:
SELECT COUNT(DISTINCT user_id) AS unique_users
FROM events
WHERE event_name = 'checkout_completed';Without DISTINCT you would count events, not users — and a power user who checked out three times would inflate the metric. The difference between "how many checkouts happened" (COUNT(*)) and "how many users checked out" (COUNT(DISTINCT user_id)) is how dashboards get quietly inflated for a quarter before anyone notices.
When GROUP BY is the right tool
Reach for GROUP BY when the deliverable is a per-category metric. The shape is "X by Y": revenue by country, signups by week, orders by customer.
SELECT topic, COUNT(*) AS answers
FROM training_events
GROUP BY topic
ORDER BY answers DESC;GROUP BY also enables HAVING for group-level filtering — only topics studied by more than 100 distinct users:
SELECT topic,
COUNT(DISTINCT user_id) AS users
FROM training_events
GROUP BY topic
HAVING COUNT(DISTINCT user_id) > 100;You cannot rewrite that with DISTINCT alone. The metric is per-group, the filter is per-group, the entire shape is per-group. GROUP BY territory.
When both produce the same result
One scenario has DISTINCT and GROUP BY returning the same rows: a single column, no aggregates, no HAVING.
SELECT DISTINCT city FROM users;
SELECT city FROM users GROUP BY city;Both return the same cities. In PostgreSQL the planner usually picks the same execution path — typically a HashAggregate. Identical plan, identical runtime, identical result.
The right call is not "use the one that runs faster" — it is "use the one that reads correctly". DISTINCT signals "I want unique values, full stop". GROUP BY without an aggregate signals "I forgot the aggregate". Spend the seven characters and write DISTINCT.
Performance and execution plans
In modern engines — PostgreSQL 14+, BigQuery, Snowflake, Databricks SQL, ClickHouse — the planner reduces both queries to the same logical operation when they are equivalent: produce unique tuples. The physical operator is usually a hash aggregate, or a streaming distinct when input is already sorted on the group key. EXPLAIN ANALYZE on the two variants above shows the same plan node and the same estimated cost.
Differences do appear in older engines: MySQL 5.x without loose-index-scan picks different strategies on compound indexes; older Hive/Trino routed DISTINCT through a separate operator. None of this is large enough to justify rewriting for speed. The bigger wins come from filtering earlier, projecting fewer columns, and joining on indexed keys.
One real concern: SELECT DISTINCT * over a wide table is almost always wasted work. Every extra column widens the tuple to hash, and the chance of any two rows being identical across forty columns is essentially zero. Step back and ask what you actually want unique — usually one or two columns.
Common pitfalls
The most painful pitfall is mixing up "how many events" with "how many users". SELECT COUNT(*) FROM page_views WHERE page = '/pricing' answers "how many pricing-page views happened". A PM who asked "how many users saw the pricing page" needs COUNT(DISTINCT user_id). Both numbers look plausible on a dashboard but tell different stories and can differ by an order of magnitude when power users dominate. The fix is to read the question out loud, name the noun being counted, and pick the SQL primitive that matches.
A related trap is using GROUP BY for deduplication. Queries like SELECT customer_id FROM orders GROUP BY customer_id appear in legacy codebases. They work, but they obscure intent — a teammate scanning the query looks for an aggregate function, finds none, and assumes the query is broken or that an aggregate was deleted. Replace with SELECT DISTINCT customer_id FROM orders. The execution plan does not change; the readability does.
A third pitfall is forgetting that DISTINCT applies to the entire SELECT list. SELECT DISTINCT user_id, event_ts FROM events does not give you distinct user IDs — it gives you distinct (user_id, event_ts) pairs, which is almost certainly every row in the table because event timestamps are rarely identical. If the intent was "one row per user", you want SELECT DISTINCT user_id or a window function with a row-number filter.
A fourth pitfall is the NULL surprise. Both DISTINCT and GROUP BY treat NULL as equal to itself, the opposite of how NULL behaves in WHERE (NULL = NULL is NULL, not TRUE). SELECT DISTINCT email FROM users returns one row for "NULL email" if any user has a missing email. To exclude that NULL row, filter with WHERE email IS NOT NULL before deduplicating.
The fifth pitfall is performance panic — rewriting a clear DISTINCT into GROUP BY because someone read that GROUP BY is faster. On modern planners the two are equivalent. If EXPLAIN shows the same plan, use the keyword that reads better.
Optimization tips
Push filters above the deduplication. A WHERE clause that runs before DISTINCT or GROUP BY reduces input size and shrinks the hash table the engine has to build — the single biggest lever for these operations.
Project fewer columns. Every extra column in SELECT DISTINCT or in the GROUP BY tuple grows the comparison key and the memory needed to hold the hash table.
Approximate when exact does not matter. For dashboards over billions of rows, APPROX_COUNT_DISTINCT(user_id) in BigQuery, HLL_COUNT.DISTINCT in Snowflake, or uniq() in ClickHouse computes a close estimate via HyperLogLog at a fraction of the cost. A 2% error on monthly active users is invisible to a stakeholder and saves hours of compute.
Interview answers
"DISTINCT and GROUP BY can give the same result. When would you pick one?" DISTINCT for deduplication with no metric; GROUP BY when there is a metric per category. If a junior writes GROUP BY without an aggregate, rewrite with DISTINCT to make intent obvious.
"Which is faster?" On modern engines like PostgreSQL 14+, Snowflake, or BigQuery, they reduce to the same physical plan and run identically. The better question is which reads correctly.
"Can DISTINCT work with multiple columns?" Yes. DISTINCT applies to the full SELECT list — the deduplication key is the tuple of every projected column.
"What is the difference between COUNT(col) and COUNT(DISTINCT col)?" COUNT(col) counts non-null values including duplicates. COUNT(DISTINCT col) counts unique non-null values. The first answers "how many rows have a value", the second answers "how many different values appear".
Related reading
- SQL window functions interview questions — when neither DISTINCT nor GROUP BY is the right tool.
- CASE WHEN in SQL: a guide — conditional logic that pairs well with GROUP BY.
- COALESCE in SQL: a guide — handling NULL before deduplication.
- NULL in SQL: a guide — why NULL behaves differently in WHERE vs GROUP BY.
If you want to drill SQL patterns like this every day, NAILDD launches with 500+ analyst-style SQL problems built around this kind of decision.
FAQ
Can I use DISTINCT and GROUP BY in the same query?
Yes, and it is a common combination. The typical form is COUNT(DISTINCT col) inside a GROUP BY, which asks "how many unique values of X per group of Y". For example, SELECT department, COUNT(DISTINCT city) AS cities FROM employees GROUP BY department returns the number of distinct cities each department operates in. The DISTINCT lives inside the aggregate, the GROUP BY forms the groups, and they do not conflict.
Does DISTINCT work on only one column?
No. DISTINCT applies to the entire SELECT list. If you write SELECT DISTINCT a, b, c, the engine deduplicates on the tuple (a, b, c). Analysts add an extra column to a SELECT DISTINCT query and are surprised when the row count balloons, because the new column makes more tuples unique.
When is GROUP BY genuinely more useful than DISTINCT?
Whenever you need an aggregate or a HAVING filter. DISTINCT has no notion of computing a metric per category, so anything of the form "X by Y with a count, sum, or average" requires GROUP BY. HAVING extends that to filter on the computed metric, which DISTINCT cannot replicate.
Is there a difference in how each handles NULL?
Both treat NULL as equal to itself for deduplication. A column with three NULL rows produces one NULL row after SELECT DISTINCT col or after GROUP BY col. This contrasts with WHERE col = NULL, which is never true. To exclude NULL from the output, filter with WHERE col IS NOT NULL before deduplicating.
Will the query optimizer rewrite one into the other?
In modern engines, the optimizer treats the two equivalent forms as the same logical operation and chooses the same physical plan — typically a hash aggregate. Verify with EXPLAIN ANALYZE on PostgreSQL or by looking at the query profile on Snowflake or BigQuery. Because the plans match, picking between DISTINCT and GROUP BY for performance reasons is rarely worth the cognitive cost.