GROUP BY in SQL: a working analyst's guide
Contents:
Why GROUP BY matters
GROUP BY is the workhorse of analytics SQL. It collapses many rows into fewer rows by combining records that share the same value in one or more columns, then runs an aggregate — COUNT, SUM, AVG, MIN, MAX — across each group. Without it, you cannot answer the questions that pay analyst salaries: revenue by month, signups by channel, conversion by experiment arm.
On a data analyst loop at Stripe, Airbnb, DoorDash, or Snowflake, GROUP BY shows up in almost every SQL question. The phrasing is rarely "use GROUP BY" — it is "show me revenue per cohort", "rank channels by activation rate", "find customers with five or more orders this quarter". This guide walks the syntax, where GROUP BY earns its keep, and the traps that show up in screens.
Syntax and execution order
The skeleton looks deceptively simple. WHERE filters individual rows, GROUP BY forms the buckets, HAVING filters the buckets, and ORDER BY sorts whatever survives:
SELECT column, AGG(other_column)
FROM TABLE
WHERE row_filter
GROUP BY column
HAVING aggregate_filter
ORDER BY column;The logical execution order is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT. That order explains every confusing error message you will hit. WHERE cannot reference aggregates because GROUP BY has not run yet. HAVING can. Aliases from SELECT are unavailable in WHERE or GROUP BY in strict Postgres but available in ORDER BY because ORDER BY runs last.
A minimal worked example — counting orders per category:
SELECT
category,
COUNT(*) AS order_count
FROM orders
GROUP BY category;Each distinct value of category becomes one bucket. COUNT(*) returns the number of rows in that bucket. The result is one row per category, no matter how many millions of orders were behind them.
GROUP BY with aggregate functions
GROUP BY without an aggregate is almost always pointless — it reduces to SELECT DISTINCT.
COUNT is the most common. COUNT(*) counts every row in the group. COUNT(column) counts non-NULL values. COUNT(DISTINCT column) counts unique non-NULL values. Getting them wrong is a classic interview trap:
SELECT
city,
COUNT(*) AS rows_in_group,
COUNT(user_id) AS non_null_users,
COUNT(DISTINCT user_id) AS unique_users
FROM events
GROUP BY city
ORDER BY unique_users DESC;If events has duplicates after a join, COUNT(*) will lie about user counts. Use COUNT(DISTINCT user_id) when the question is "how many people".
SUM adds numeric values; NULLs are ignored:
SELECT
product_id,
SUM(quantity) AS units_sold,
SUM(amount) AS revenue
FROM order_items
GROUP BY product_id
ORDER BY revenue DESC;AVG computes the mean over non-NULL values. To treat NULLs as zero, wrap with COALESCE — otherwise the denominator changes silently:
SELECT
department,
ROUND(AVG(salary), 2) AS avg_salary_excl_nulls,
ROUND(AVG(COALESCE(salary, 0)), 2) AS avg_salary_incl_nulls,
COUNT(*) AS headcount
FROM employees
GROUP BY department;MIN and MAX work on numbers, dates, and strings. They are also handy for picking a representative non-aggregate column: MAX(name) returns one name from the group without errors when you do not care about ties.
Multi-column grouping
You can group on two or more columns. Each unique combination becomes its own bucket. The result set grows multiplicatively with the cardinality of each column:
SELECT
city,
segment,
COUNT(*) AS user_count,
SUM(total_spent) AS revenue
FROM users
GROUP BY city, segment
ORDER BY city, revenue DESC;If there are 5 cities and 3 segments, the upper bound is 15 groups. In practice you will see fewer rows because not every combination exists — also a debugging clue if you expected 15 and got 12.
HAVING: filtering after aggregation
WHERE filters rows before grouping. HAVING filters groups after. The split exists because at the WHERE step, groups do not exist yet, so aggregates have nothing to evaluate:
SELECT
user_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent
FROM orders
GROUP BY user_id
HAVING COUNT(*) >= 5
ORDER BY total_spent DESC;WHERE COUNT(*) >= 5 throws a syntax error in Postgres and Snowflake. HAVING is the right clause for aggregates. You can combine the two — WHERE narrows the input rows, HAVING narrows the output groups:
SELECT
category,
SUM(amount) AS revenue
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY category
HAVING SUM(amount) > 100000;WHERE drops orders from before 2026. GROUP BY collapses what remains by category. HAVING removes categories whose total revenue is at or below 100,000. Mental model: WHERE is "which records belong in scope", HAVING is "which buckets are big enough to care about".
GROUP BY with expressions
GROUP BY accepts expressions, not just bare columns. The most common case is rolling timestamps to a coarser grain with DATE_TRUNC:
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue,
COUNT(DISTINCT user_id) AS buyers
FROM orders
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;In strict Postgres the expression in GROUP BY must match the SELECT expression character for character. GROUP BY month after aliasing in SELECT fails because the alias does not exist at the grouping step. Workarounds: repeat the expression, or use positional notation GROUP BY 1. Positional notation is concise but fragile because reordering SELECT silently changes the grouping.
Another common expression is bucketing with CASE WHEN — see CASE WHEN in SQL: full guide for conditional patterns that pair with GROUP BY.
Worked examples for analyst interviews
Three templates show up in almost every SQL screen. Worth memorizing.
Conversion by acquisition channel — one column per stage, division at the end, NULLIF to dodge divide-by-zero:
SELECT
utm_source,
COUNT(DISTINCT user_id) AS visitors,
COUNT(DISTINCT CASE WHEN has_purchase THEN user_id END) AS buyers,
ROUND(
COUNT(DISTINCT CASE WHEN has_purchase THEN user_id END)::numeric
/ NULLIF(COUNT(DISTINCT user_id), 0) * 100,
2
) AS conversion_pct
FROM user_visits
GROUP BY utm_source
ORDER BY conversion_pct DESC;Revenue and order count by month — the shape every product analyst writes in their first week:
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS monthly_revenue,
COUNT(*) AS orders
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;Top cities by average order value with a minimum sample size. HAVING earns its keep here — without the floor, cities with two orders dominate the leaderboard:
SELECT
u.city,
ROUND(AVG(o.amount), 2) AS avg_check,
COUNT(*) AS order_count
FROM orders o
JOIN users u ON o.user_id = u.id
GROUP BY u.city
HAVING COUNT(*) >= 30
ORDER BY avg_check DESC
LIMIT 10;The HAVING COUNT(*) >= 30 line is the trick most candidates miss. Without it the highest-AOV city is one with a single luxury order and no signal.
GROUP BY vs PARTITION BY
The two clauses sound similar and trip up candidates constantly. GROUP BY collapses a group of rows into one row. PARTITION BY, inside a window function, computes an aggregate across a group while keeping every original row intact:
-- GROUP BY: one row per category
SELECT
category,
SUM(amount) AS revenue
FROM orders
GROUP BY category;
-- PARTITION BY: every order is kept, category revenue is added as a column
SELECT
order_id,
category,
amount,
SUM(amount) OVER (PARTITION BY category) AS category_revenue
FROM orders;Rule of thumb: if the answer is "one row per group", use GROUP BY. If you need row-level detail with the aggregate next to it — "show every order with its share of category revenue" — use a window function with PARTITION BY. Window-function patterns are covered in depth in SQL window functions interview questions. GROUP BY also overlaps with SELECT DISTINCT; differences are in DISTINCT vs GROUP BY in SQL.
Common pitfalls
The first trap is putting a non-aggregated column in SELECT that is not in GROUP BY. A query like SELECT user_id, name, COUNT(*) FROM orders GROUP BY user_id fails in Postgres and standard SQL with a "column must appear in GROUP BY clause or be used in an aggregate" error. The fix is to add name to GROUP BY or wrap it with MAX(name). MySQL with default sql_mode silently picks an arbitrary value, which is a debugging nightmare.
The second trap is reaching for WHERE when you meant HAVING. WHERE SUM(amount) > 10000 is a category error — WHERE runs row by row before grouping, so SUM has nothing to sum. The fix is HAVING SUM(amount) > 10000. If the filter does not involve an aggregate, WHERE is correct and faster because it cuts the row count before aggregation.
The third trap is COUNT(*) when you needed COUNT(DISTINCT user_id) — the most common bug in analytics queries with joins. After a one-to-many join, every user row is duplicated by the joined table's cardinality, and COUNT(*) reports the join cardinality, not the user count. For "how many unique users", COUNT(DISTINCT user_id) is the only safe answer.
The fourth trap is grouping by an alias in strict Postgres. SELECT DATE_TRUNC('month', dt) AS month FROM orders GROUP BY month fails because the alias does not exist at the GROUP BY stage. Repeat the expression or use positional notation GROUP BY 1. MySQL allows aliases, which is convenient until you port the query to Postgres.
The fifth trap is forgetting NULL handling. SQL treats NULL as a value — all NULL rows collapse into one NULL group. To drop them, add WHERE column IS NOT NULL. To label them, group by COALESCE(channel, 'unknown').
Interview questions
What does GROUP BY do? It collects rows that share the same value in the listed columns into one group, then exposes the group to aggregates like COUNT, SUM, AVG, MIN, MAX. The output has one row per distinct combination of grouping values.
Which columns are valid in SELECT when GROUP BY is present? Only columns listed in GROUP BY and the outputs of aggregate functions. Anything else must be added to GROUP BY or wrapped in an aggregate. Postgres and Snowflake enforce this strictly; MySQL with default sql_mode does not.
How does GROUP BY differ from PARTITION BY? GROUP BY collapses groups into one row per group. PARTITION BY inside window functions computes the aggregate over the partition but keeps every original row.
Write a query for monthly revenue in 2026. The standard answer:
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(amount) AS revenue
FROM orders
WHERE order_date >= '2026-01-01'
AND order_date < '2027-01-01'
GROUP BY DATE_TRUNC('month', order_date)
ORDER BY month;WHERE narrows rows to 2026, GROUP BY rolls them into months, SUM totals the amount.
Can you GROUP BY an expression? Yes — DATE_TRUNC('day', dt), EXTRACT(YEAR FROM dt), CASE WHEN amount > 100 THEN 'high' ELSE 'low' END, anything. In strict Postgres the expression must match SELECT exactly.
Related reading
- DISTINCT vs GROUP BY in SQL
- CASE WHEN in SQL: full guide
- SQL window functions interview questions
- COALESCE in SQL: full guide
- CTE WITH in SQL: full guide
If you want to drill questions like this every day, NAILDD is launching with 500+ SQL problems built around exactly these patterns.
FAQ
In what order do GROUP BY, WHERE, and HAVING run?
The logical order is FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT. WHERE filters rows before grouping. GROUP BY forms groups from the surviving rows. HAVING filters those groups by aggregates. SELECT projects the final columns, ORDER BY sorts, LIMIT trims. This explains why aggregates work in HAVING but not WHERE, and why aliases work in ORDER BY but not GROUP BY in strict Postgres — the alias does not exist until SELECT has run.
Can I use GROUP BY without any aggregate function?
Technically yes — GROUP BY without an aggregate behaves like SELECT DISTINCT. In practice this is a code smell. If you want unique combinations, use DISTINCT and signal the intent. GROUP BY is shorthand for "I am about to aggregate".
How does GROUP BY handle NULL values?
NULL is treated as a regular value for grouping. All rows where the grouping column is NULL fall into one bucket. This is standard behavior across Postgres, MySQL, Snowflake, and BigQuery. To exclude them, add WHERE column IS NOT NULL. To keep them with a meaningful label, group by COALESCE(column, 'unknown').
Why does my GROUP BY query return more rows than expected?
Two suspects: an extra column in GROUP BY that you forgot about, and a join that introduced duplicates. If you grouped by city, segment, plan_tier but only meant city, segment, you get every plan_tier combination back. If you joined a one-to-many table without aggregating first, every parent row is duplicated by child cardinality. Fix: drop the extra grouping column, or aggregate the joined table to one row per join key in a CTE before joining.
Can I use a column alias inside GROUP BY?
Depends on the database. MySQL and SQLite let you reference SELECT aliases inside GROUP BY. Postgres and standard SQL do not — SELECT logically runs after GROUP BY. The portable workarounds are to repeat the expression in full or use positional notation GROUP BY 1, 2. Positional notation is shorter but fragile because reordering SELECT silently changes the grouping.
Does GROUP BY guarantee any ordering of the output?
No. Some databases produce sorted output because the optimizer used a sorted aggregate, but you cannot rely on it. Always add ORDER BY when you need a specific order. Treating GROUP BY output as ordered is a painful bug because the query "worked" yesterday and started returning rows differently today after the planner switched to a hash aggregate.