GROUP BY and HAVING in SQL
Contents:
- Why GROUP BY shows up in every analyst interview
- Aggregate functions you will use every day
- GROUP BY across multiple columns
- HAVING: filtering groups after aggregation
- Logical execution order of a SQL query
- Worked interview examples
- How NULL behaves under GROUP BY
- Common pitfalls
- Related reading
- FAQ
Why GROUP BY shows up in every analyst interview
GROUP BY is the first SQL construct that separates someone who can read a query from someone who can answer business questions. Every aggregated metric an analyst at Stripe, Airbnb, or DoorDash produces — revenue by category, orders per user, weekly active accounts by acquisition channel — comes out of a GROUP BY block. If you write SELECT and WHERE confidently but stumble on grouping, the technical loop is going to feel rougher than it should.
The mental model is short: GROUP BY collapses many input rows into one output row per unique value of the grouping key. Every other column in SELECT must either be one of those keys or wrapped in an aggregate function. Once that rule clicks, half the GROUP BY questions in an interview become typing exercises.
A PM pings you on Slack at 5pm: "Can you give me orders per city for last month, top 20 by volume, by standup tomorrow?" That is GROUP BY + COUNT + ORDER BY + LIMIT, nothing more.
SELECT
department,
COUNT(*) AS employee_count
FROM employees
GROUP BY department;One row per department, with the headcount next to it. That is the entire mechanic.
Aggregate functions you will use every day
GROUP BY without an aggregate function is almost never useful. Five functions cover roughly 95% of analyst work.
COUNT(*) returns the number of rows in each group, including rows where every column is NULL. It is the count of physical records.
SELECT city, COUNT(*) AS order_count
FROM orders
GROUP BY city;SUM() adds up a numeric column inside each group — revenue, units sold, minutes watched.
SELECT category, SUM(amount) AS total_revenue
FROM sales
GROUP BY category;AVG() is the arithmetic mean. It silently ignores NULLs in both the numerator and the denominator, which surprises people in code review.
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department;MIN() and MAX() return the smallest and largest value in the group. They work on dates, strings, and numbers, which lets you grab first and last touch timestamps in one query.
SELECT
user_id,
MIN(order_date) AS first_order,
MAX(order_date) AS last_order
FROM orders
GROUP BY user_id;COUNT(DISTINCT) returns the number of unique non-NULL values in the column. Interviewers single it out because the gap between COUNT(*) and COUNT(DISTINCT user_id) is the gap between an event volume metric and a user metric.
SELECT
event_date,
COUNT(DISTINCT user_id) AS dau
FROM user_activity
GROUP BY event_date;One sentence to memorize: COUNT(*) counts rows, COUNT(column) counts rows where the column is not NULL, and COUNT(DISTINCT column) counts unique non-NULL values. The interviewer at Meta or Snowflake will ask you to articulate that.
GROUP BY across multiple columns
You can group by more than one key. Every unique combination of those keys becomes its own output row.
SELECT
category,
region,
SUM(amount) AS total_revenue
FROM sales
GROUP BY category, region
ORDER BY category, total_revenue DESC;Five categories and three regions produce up to fifteen rows — fewer if some combinations have no sales. A frequent prompt is "give me order counts by month and city." Same shape, with DATE_TRUNC doing the bucketing.
SELECT
DATE_TRUNC('month', order_date) AS month,
city,
COUNT(*) AS order_count
FROM orders
GROUP BY DATE_TRUNC('month', order_date), city
ORDER BY month, order_count DESC;Many dialects, including PostgreSQL, let you write GROUP BY 1, 2 referencing SELECT positions. That works, but it stops being readable past three columns and breaks when someone reorders SELECT. Spell out the expressions in production code.
HAVING: filtering groups after aggregation
WHERE filters rows before they are grouped. HAVING filters groups after aggregation. That single sentence is the most useful thing to memorize on this topic.
WHERE operates on raw input rows:
SELECT
category,
SUM(amount) AS total_revenue
FROM sales
WHERE amount > 0
GROUP BY category;The condition amount > 0 removes refund and zero rows before SUM runs. HAVING operates on the aggregated result:
SELECT
category,
SUM(amount) AS total_revenue
FROM sales
GROUP BY category
HAVING SUM(amount) > 100000;Only categories with total revenue above 100,000 USD survive. You cannot express that in WHERE — at the WHERE stage, no aggregate has been computed yet. WHERE doesn't know the categories exist as groups.
You combine them constantly:
SELECT
user_id,
COUNT(*) AS order_count
FROM orders
WHERE order_date >= '2026-01-01'
GROUP BY user_id
HAVING COUNT(*) >= 3;WHERE narrows the table to 2026 orders. GROUP BY collapses by user. HAVING keeps users with three or more orders. One job per clause.
Logical execution order of a SQL query
This is the single most asked theoretical SQL interview question. Memorize it.
- FROM — pick the source table or join graph.
- WHERE — filter raw rows.
- GROUP BY — collapse rows into groups.
- HAVING — filter groups.
- SELECT — evaluate output expressions, including aggregates.
- ORDER BY — sort the result.
- LIMIT — truncate.
Several rules fall out of this order. You cannot use an aggregate inside WHERE — grouping hasn't happened yet. You cannot reference a SELECT alias inside HAVING in standard SQL — the alias is invented during SELECT, which runs later (PostgreSQL bends this rule; don't rely on it at a whiteboard). You can reference SELECT aliases inside ORDER BY. And under GROUP BY, every non-aggregated column in SELECT must appear in the GROUP BY list.
That last rule is the most common compile error juniors hit. SELECT user_id, name, COUNT(*) FROM orders GROUP BY user_id fails because name is neither in GROUP BY nor wrapped in an aggregate. Fix: add name to GROUP BY (assumes one name per user_id) or wrap it as MAX(name).
Worked interview examples
Revenue by category, last 30 days
SELECT
p.category,
COUNT(DISTINCT o.order_id) AS orders,
SUM(o.amount) AS revenue,
ROUND(AVG(o.amount), 2) AS avg_order_value
FROM orders o
JOIN products p ON o.product_id = p.product_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY p.category
ORDER BY revenue DESC;Notice COUNT(DISTINCT o.order_id). If the join to products ever fans out — and interviewers love to set that trap — plain COUNT(*) will overcount orders. Defensive aggregation is a small habit that catches a lot of bugs.
Repeat buyers with three or more orders
SELECT
user_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent
FROM orders
GROUP BY user_id
HAVING COUNT(*) >= 3
ORDER BY total_spent DESC;GROUP BY + HAVING shows up in roughly every other analyst loop. Wording varies — "users who purchased more than N times", "cities with revenue above X", "days where DAU exceeded Y" — the skeleton is identical.
Monthly Active Users
SELECT
DATE_TRUNC('month', event_date) AS month,
COUNT(DISTINCT user_id) AS mau
FROM user_activity
GROUP BY DATE_TRUNC('month', event_date)
ORDER BY month;COUNT(DISTINCT user_id) is non-negotiable. COUNT(*) would return event volume, often ten times higher than the user count. That is the textbook mistake interviewers hope you make.
Average order value by day of week
SELECT
EXTRACT(DOW FROM order_date) AS day_of_week,
COUNT(*) AS orders,
ROUND(AVG(amount), 2) AS avg_order_value
FROM orders
GROUP BY EXTRACT(DOW FROM order_date)
ORDER BY day_of_week;The grouping expression must repeat verbatim in SELECT and GROUP BY in standard SQL. PostgreSQL accepts a SELECT alias here, but the portable form is the one above.
How NULL behaves under GROUP BY
GROUP BY treats all NULL values as a single group. This is a deliberate exception to the rule "NULL is never equal to NULL" you learn elsewhere in SQL.
SELECT
region,
COUNT(*) AS user_count
FROM users
GROUP BY region;If 4,000 users have region = NULL, you get one extra row with region as NULL and user_count = 4000. BI tools render that as a blank cell, which is how stakeholders end up asking "what's the empty row?" Add COALESCE(region, 'unknown') to SELECT and GROUP BY when reporting to non-technical audiences.
Aggregates other than COUNT(*) skip NULL inputs. AVG(salary) ignores rows where salary IS NULL — they don't appear in the numerator or denominator. COUNT(salary) returns the number of non-NULL salary values. The classic trick question: "the table has 100 rows, 20 of them with salary = NULL; what does AVG(salary) divide by?" Answer: 80.
Common pitfalls
The first trap is using COUNT(*) when the column name was the right answer. The two look identical until the table has NULLs, and then the numbers diverge. To count "users with a phone number on file", you want COUNT(phone), not COUNT(*). Spell out the column whenever the question is about a specific attribute being populated.
A close second is forgetting that joins fan out before aggregation. Joining orders to order_items multiplies row counts, so a downstream SUM(orders.amount) will overcount revenue by items-per-order. Aggregate order_items in a subquery before joining, or pre-aggregate one side in a CTE — defensive subqueries are clearer than tricky distinct-sums.
A third pitfall is putting an aggregate inside WHERE. WHERE COUNT(*) > 5 is a compile error in every dialect because WHERE runs before grouping. The fix is HAVING, not a hack. Walk through the logical order once more if this feels fuzzy.
A fourth pitfall shows up in dashboards: grouping by order_date directly instead of DATE_TRUNC('day', order_date). If the column is a timestamp, every row lands in its own group and you get millions of single-event "groups" rather than daily aggregates. Always truncate timestamps to the reporting resolution.
A fifth, subtler one is reusing a SELECT alias inside HAVING. PostgreSQL and MySQL allow HAVING total > 100 if total is an alias from SELECT, but Snowflake, BigQuery, and Trino tend to be stricter. Production code that runs across warehouses should repeat the aggregate expression — verbose, but portable.
Related reading
- DISTINCT vs GROUP BY in SQL
- SQL window functions interview questions
- CTE vs subquery in SQL
- NULL in SQL: the complete guide
If you want to drill aggregation patterns until they are automatic, NAILDD is shipping with 500+ SQL problems sorted by topic — GROUP BY, HAVING, window functions, joins, NULL handling — each with the kind of detailed solution you would walk a junior teammate through.
FAQ
Can you use GROUP BY without an aggregate function?
Technically yes. SELECT category FROM products GROUP BY category returns the unique categories, behaving identically to SELECT DISTINCT category FROM products. In practice this is bad style: it signals "I am aggregating" while doing nothing of the sort. Use DISTINCT when the intent is deduplication and GROUP BY when the intent is aggregation.
Difference between HAVING COUNT(*) > 1 and a subquery doing the same thing?
They produce the same result, but the execution profile differs. HAVING runs in the same logical pass — group, then filter. A subquery with a self-join or correlated lookup typically forces the engine to scan twice or build an intermediate result set. For simple conditions HAVING is faster and shorter. For conditions that compare each group to a global statistic — "users whose average order value is above the overall average" — a CTE or subquery makes the math explicit and reads cleaner.
How does GROUP BY interact with JOIN?
The join runs first, producing a wider intermediate table, and GROUP BY collapses the result. The pitfall is unintended fan-out: if one row on the left joins to many rows on the right, your aggregates will silently overcount. Run COUNT(*) vs COUNT(DISTINCT primary_key) on the join result before adding final aggregation; if they disagree, the join is duplicating. Either tighten the join condition or pre-aggregate one input in a CTE.
Why does my SELECT clause throw an error when I add a column that isn't in GROUP BY?
The rule: every non-aggregated column in SELECT must appear in GROUP BY. After grouping, each output row corresponds to many input rows, and the database has no way to pick which input value to display. Older MySQL with ONLY_FULL_GROUP_BY disabled would silently pick an arbitrary row — the exact nondeterminism that ships wrong dashboards. Wrap the offending column in MAX(), MIN(), or STRING_AGG().
When should HAVING be replaced by a window function?
HAVING filters entire groups out of the result. A window function annotates each row with a group-level statistic while keeping every row. For "which orders belong to a user with three or more purchases", you want a window function followed by a filter, not GROUP BY + HAVING — the latter collapses you to one row per user and throws away the order detail you needed.