May 18, 2026·11 min read

GROUP BY SQL cheat sheet

Q: What is the difference between GROUP BY and DISTINCT?

`DISTINCT` deduplicates result rows. `GROUP BY` does the same and lets you compute aggregates per group. For unique values with no aggregation, `DISTINCT` is more readable and the optimizer typically produces an identical plan. For per-group metrics, `GROUP BY` is the only option. Performance is usually equivalent — both go through a hash- or sort-based deduplication step.

Q: How do I count distinct values across two columns at once?

Use a tuple: `COUNT(DISTINCT (a, b))` works on PostgreSQL and Snowflake. On engines without tuple support, concatenate with an unambiguous delimiter — `COUNT(DISTINCT a || '|' || b)` — picking one that cannot appear in the values. For very large tables prefer `APPROX_COUNT_DISTINCT`.

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Join the waitlist

Contents:

Basic syntax
Column rule
Aggregate reference
WHERE vs HAVING
Execution order
Multi-column grouping
NULL behavior
GROUPING SETS, ROLLUP, CUBE
With window functions
12 interview tasks
Common pitfalls
Optimization tips
Related reading
FAQ

Why GROUP BY matters for analysts

GROUP BY does the one thing every analytical query exists for: it collapses raw rows into aggregates. Revenue per customer, average session duration per day, count per acquisition channel — they all flow through GROUP BY. In an analyst loop at Stripe, Airbnb, or DoorDash, you will hit a GROUP BY question in roughly every second SQL screen.

The bar is whether you can predict the row count of a multi-column grouping, explain why an aggregate is illegal in WHERE, and reach for ROLLUP without flipping to docs. This cheat sheet walks syntax, the WHERE vs HAVING trap, NULL behavior, and OLAP extensions, then closes with 12 interview tasks.

Basic syntax

SELECT
    grouping_column,
    aggregate_function(column)
FROM TABLE
WHERE pre_group_filter
GROUP BY grouping_column
HAVING post_group_filter
ORDER BY column_or_aggregate;

WHERE filters rows before grouping. GROUP BY folds the surviving rows into groups, one per unique combination of grouping columns. HAVING filters the resulting groups and can reference aggregates such as COUNT(*) or SUM(amount). ORDER BY runs last on the grouped result. The sequence is: row filter, fold, group filter, sort.

The "every column must be in GROUP BY or an aggregate" rule

The most frequent mistake is selecting a column that is neither grouped nor aggregated:

-- Wrong: user_id is not in GROUP BY and not in an aggregate
SELECT user_id, city, SUM(amount)
FROM orders
GROUP BY city;

The fix is to drop the loose column or include it in the grouping list:

SELECT city, SUM(amount)
FROM orders
GROUP BY city;

Interviewers ask: "Can you put a column in SELECT that is not in GROUP BY?" No, unless it sits inside an aggregate. PostgreSQL and MySQL in non-strict mode silently pick an arbitrary row's value, which ships wrong dashboards. Snowflake, BigQuery, Redshift, and SQL Server reject the query. Always assume strict mode in an interview.

Aggregate functions reference

Function	What it returns
`COUNT(*)`	All rows in the group, NULLs included
`COUNT(col)`	Non-NULL values of `col`
`COUNT(DISTINCT col)`	Distinct non-NULL values of `col`
`SUM(col)`	Sum, ignoring NULL
`AVG(col)`	Mean, ignoring NULL
`MIN(col)` / `MAX(col)`	Smallest / largest value
`STRING_AGG(col, ',')`	Concatenates values with a separator
`ARRAY_AGG(col)`	Collects values into an array

Classic phone-screen question: "What is the difference between COUNT(*) and COUNT(column)?" COUNT(*) counts every row regardless of NULLs. COUNT(column) counts only non-NULL rows. If the column has NULLs the two numbers diverge.

WHERE vs HAVING

The famous trap is putting an aggregate in WHERE:

-- WHERE runs before grouping, on raw rows
SELECT city, COUNT(*) AS users
FROM users
WHERE created_at >= '2026-01-01'
GROUP BY city;

-- HAVING runs after grouping, on aggregates
SELECT city, COUNT(*) AS users
FROM users
GROUP BY city
HAVING COUNT(*) > 100;

WHERE COUNT(*) > 100 is a syntax error — at WHERE time the aggregate does not exist yet. Use both clauses together: WHERE narrows the rows that feed the grouping, HAVING narrows the groups that come out. Push everything you can into WHERE — the optimizer uses indexes on row filters, and the grouping then operates on a smaller intermediate result.

Logical query execution order

SQL is written in one order and executed in another. The logical order:

FROM plus any JOIN
WHERE
GROUP BY
HAVING
SELECT
DISTINCT
ORDER BY
LIMIT

This is why aggregates are legal in HAVING and SELECT but illegal in WHERE, and why SELECT aliases cannot generally be referenced in WHERE or GROUP BY. Reciting this order in an interview signals you understand the engine.

GROUP BY across multiple columns

SELECT city, device, COUNT(*) AS sessions
FROM events
GROUP BY city, device;

Groups form from each unique combination of grouping-column values. With 100 cities and 3 device types the result has at most 300 rows. The interview question: "How many rows will GROUP BY across three columns return?" Answer: as many as there are distinct combinations in the data, bounded above by the product of their cardinalities.

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Join the waitlist

NULL behavior in GROUP BY

NULL gets its own group. Every NULL in a grouping column folds into the same bucket:

SELECT category, COUNT(*)
FROM products
GROUP BY category;
-- Returns one row with category = NULL if any product is uncategorized

This violates NULL <> NULL strict semantics, but a separate group per NULL would be useless. To map NULL to a named bucket: GROUP BY COALESCE(category, 'unknown').

GROUPING SETS, ROLLUP, and CUBE

The OLAP extensions show up on mid-level and senior screens, especially at reporting-heavy teams.

GROUPING SETS

Multiple grouping levels in one pass:

SELECT city, device, COUNT(*)
FROM events
GROUP BY GROUPING SETS (
    (city, device),  -- BY city AND device
    (city),          -- BY city only
    (device),        -- BY device only
    ()               -- grand total
);

One query, four aggregation levels, no UNION ALL. NULL placeholders mark columns absent from a given grouping set.

ROLLUP

Hierarchical subtotals, from most detailed to grand total:

SELECT year, quarter, month, SUM(revenue)
FROM sales
GROUP BY ROLLUP (year, quarter, month);

You get a row per month, per quarter, per year, and a grand total — the shape a finance dashboard wants.

CUBE

Every possible combination of the listed columns:

SELECT city, device, SUM(revenue)
FROM sales
GROUP BY CUBE (city, device);

CUBE (city, device) equals GROUPING SETS ((city, device), (city), (device), ()). Useful for pivot-style reports with a small column count. Cardinality grows as 2^n, so avoid it on wide groupings.

Combining GROUP BY with window functions

Window functions and aggregates coexist in one statement. A common task is "show each city's revenue and its share of total":

SELECT
    city,
    SUM(amount) AS city_revenue,
    ROUND(100.0 * SUM(amount) / SUM(SUM(amount)) OVER (), 2) AS pct_of_total
FROM orders
GROUP BY city;

SUM(SUM(amount)) OVER () adds the per-city revenues into the overall total. The outer SUM is a window function over the grouped rows; the inner SUM defines those rows. A common follow-up after the basic GROUP BY question.

12 interview tasks on GROUP BY

Task 1. Orders per user

SELECT user_id, COUNT(*) AS orders_count
FROM orders
GROUP BY user_id;

Task 2. Average order value per city, cities with more than 10 orders

SELECT city, AVG(amount) AS avg_check
FROM orders
GROUP BY city
HAVING COUNT(*) > 10;

Task 3. Top 5 users by revenue

SELECT user_id, SUM(amount) AS total
FROM orders
GROUP BY user_id
ORDER BY total DESC
LIMIT 5;

Task 4. Unique buyers per day

SELECT created_at::DATE AS day, COUNT(DISTINCT user_id) AS dau
FROM orders
GROUP BY created_at::DATE;

Task 5. Categories with more than one million in revenue

SELECT category, SUM(amount) AS revenue
FROM orders JOIN products USING (product_id)
GROUP BY category
HAVING SUM(amount) > 1000000;

Task 6. Users whose first order is in March 2026

SELECT user_id, MIN(created_at) AS first_order
FROM orders
GROUP BY user_id
HAVING MIN(created_at) >= '2026-03-01'
   AND MIN(created_at) <  '2026-04-01';

SELECT
    category,
    COUNT(*) AS orders,
    ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct
FROM orders JOIN products USING (product_id)
GROUP BY category;

Task 8. Mean, median, and max order value per city

SELECT city,
    AVG(amount) AS avg_amount,
    PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY amount) AS median,
    MAX(amount) AS max_amount
FROM orders
GROUP BY city;

Task 9. Bestseller per category

WITH sales AS (
    SELECT category, product_id, SUM(quantity) AS sold
    FROM order_items JOIN products USING (product_id)
    GROUP BY category, product_id
)
SELECT DISTINCT ON (category) category, product_id, sold
FROM sales
ORDER BY category, sold DESC;

Task 10. Customers who bought all of A, B, and C

SELECT user_id
FROM orders
WHERE product_id IN ('A', 'B', 'C')
GROUP BY user_id
HAVING COUNT(DISTINCT product_id) = 3;

A COUNT(DISTINCT) in HAVING verifies full coverage of a target set — common in audience and segment questions.

Task 11. Active and inactive users in one row

SELECT
    COUNT(*) FILTER (WHERE last_active >= CURRENT_DATE - INTERVAL '30 day') AS active,
    COUNT(*) FILTER (WHERE last_active <  CURRENT_DATE - INTERVAL '30 day') AS inactive
FROM users;

FILTER (WHERE ...) does conditional aggregates without a CASE WHEN wrapper.

Task 12. Hierarchical revenue report by year, quarter, with grand total

SELECT
    EXTRACT(YEAR    FROM created_at) AS yr,
    EXTRACT(QUARTER FROM created_at) AS qtr,
    SUM(amount)
FROM orders
GROUP BY ROLLUP (EXTRACT(YEAR FROM created_at), EXTRACT(QUARTER FROM created_at));

ROLLUP adds subtotal rows per year and a grand total row, with NULLs marking rolled-up dimensions.

Common pitfalls

The first trap is selecting a column that is neither grouped nor aggregated. Strict engines reject it; permissive ones return an arbitrary value, corrupting dashboards in ways that take days to detect. Drop the column, wrap it in an aggregate, or add it to GROUP BY. Never rely on permissive mode — the query fails when ported to Snowflake.

The second trap is mixing up WHERE and HAVING. An aggregate in WHERE is a syntax error; a row predicate in HAVING works but discards the index optimizer. Most slow-query tickets at growth-stage companies are predicates in the wrong clause. Row predicates go in WHERE, aggregate predicates go in HAVING.

The third trap is forgetting that NULL forms its own group. Group by an uncategorized column and the dashboard shows a blank label readers misinterpret as missing data. Wrap with COALESCE(col, 'unknown') for human-facing output.

The fourth trap is COUNT(DISTINCT) on large tables. It parallelizes worse than COUNT(*) and blows slot budgets on Redshift, BigQuery, and Snowflake. Use APPROX_COUNT_DISTINCT when exactness is not required — about 1 percent error, runtime an order of magnitude faster.

The fifth trap is misreading multi-column cardinality. Candidates answer "product of cardinalities" instead of "distinct combinations actually present." The two diverge for correlated columns. State the bound, then the reality.

Optimization tips

For grouping queries on large tables, the biggest lever is indexing or clustering by the GROUP BY columns. On PostgreSQL, a covering index on (group_col, agg_col) lets the planner pick an index-only scan. On Snowflake and BigQuery, the equivalent is a clustering key for pre-aggregated partitions.

If you repeatedly compute the same aggregate at the same granularity, materialize it. PostgreSQL materialized views, Snowflake dynamic tables, BigQuery materialized views, Databricks Delta Live Tables all serve this purpose. Dashboards read pre-computed rows and the SUM never re-runs at query time.

For ROLLUP or GROUPING SETS, prefer a single grouped statement over four UNIONed selects. The optimizer shares scans and hash tables across grouping levels. The difference is 4x or more on multi-billion-row event tables.

When COUNT(DISTINCT) is the bottleneck and exactness is not required, use the approximate counter. Typical error is 1 to 2 percent and runtime is 5 to 20 times faster.

If you want to drill GROUP BY and aggregate patterns the way they actually appear in analyst loops, NAILDD is launching with hundreds of curated SQL problems organized by exactly this taxonomy.

FAQ

Can I use a SELECT alias inside GROUP BY?

PostgreSQL, MySQL, and BigQuery accept SELECT aliases inside GROUP BY as a convenience extension. Snowflake, SQL Server, and Oracle require the full expression repeated. The portable habit is to write the expression in both places — it works everywhere and makes the grouping target obvious to a future reader.

What is the difference between GROUP BY and DISTINCT?

DISTINCT deduplicates result rows. GROUP BY does the same and lets you compute aggregates per group. For unique values with no aggregation, DISTINCT is more readable and the optimizer typically produces an identical plan. For per-group metrics, GROUP BY is the only option. Performance is usually equivalent — both go through a hash- or sort-based deduplication step.

How does GROUP BY behave with NULL?

All NULLs in a grouping column collapse into one group. This contradicts strict SQL semantics but is the unanimous convention. For a visible label, wrap the column in COALESCE(col, 'unknown'). To exclude NULLs, add WHERE col IS NOT NULL before the grouping.

When should I use GROUPING SETS instead of multiple UNION ALL queries?

Whenever the underlying scan is expensive, which is almost always for event-level fact tables. GROUPING SETS scans the source once and reuses the hash table across grouping levels. Four separate UNION ALL selects scan the source four times. The runtime difference is 4x or more on multi-billion-row tables.

Can I reference a SELECT alias in HAVING?

PostgreSQL, MySQL, BigQuery, and Snowflake allow SELECT aliases inside HAVING as a convenience. SQL Server and Oracle do not. For fully portable SQL, repeat the aggregate expression in HAVING.

How do I count distinct values across two columns at once?

Use a tuple: COUNT(DISTINCT (a, b)) works on PostgreSQL and Snowflake. On engines without tuple support, concatenate with an unambiguous delimiter — COUNT(DISTINCT a || '|' || b) — picking one that cannot appear in the values. For very large tables prefer APPROX_COUNT_DISTINCT.