Aggregate functions in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Why aggregates matter

Aggregate functions collapse a set of rows into a single value. They count, sum, average, or find the smallest and largest entries — and without them you cannot compute a single business metric. DAU, revenue, average order value, conversion rate, retention — every dashboard a Stripe or Airbnb analyst opens on Monday is powered by some combination of COUNT, SUM, and AVG. The functions look trivial in isolation, but the way they interact with NULL, GROUP BY, and DISTINCT is where most candidates lose points on a SQL screen.

If you only remember one rule walking into an interview, remember this: all aggregates ignore NULL except COUNT(*). That single sentence explains 80% of the trick questions on a junior or mid-level SQL round at Meta, DoorDash, Snowflake, or Linear. The other 20% are about filtering before vs after the grouping, and about when to reach for COUNT(DISTINCT ...) instead of plain COUNT(*).

The five core functions

COUNT — counting rows

COUNT has three forms, and the difference between them is one of the most reliably asked SQL interview questions.

COUNT(*) — counts all rows, including rows that contain NULL in every column.

SELECT COUNT(*) AS total_orders
FROM orders;

COUNT(column) — counts rows where column IS NOT NULL.

SELECT COUNT(email) AS users_with_email
FROM users;

COUNT(DISTINCT column) — counts unique non-NULL values.

SELECT COUNT(DISTINCT user_id) AS unique_buyers
FROM orders;

Suppose users has 1,000 rows, 50 of them have email IS NULL, and 200 of the remaining 950 are duplicates. Then COUNT(*) = 1000, COUNT(email) = 950, and COUNT(DISTINCT email) = 750. Interviewers love this setup because it tests three concepts in one breath.

SUM — totaling values

SELECT SUM(amount) AS total_revenue
FROM orders
WHERE order_date >= '2025-01-01';

SUM ignores NULL. If a few rows have amount IS NULL they simply do not contribute — they are not treated as zero and they do not break the calculation. But if every value in the group is NULL, SUM returns NULL, not 0. Wrap the result in COALESCE(SUM(amount), 0) whenever a downstream chart or division would otherwise blow up.

AVG — arithmetic mean

SELECT AVG(salary) AS avg_salary
FROM employees
WHERE department = 'analytics';

AVG ignores NULL in both the numerator and the denominator. If 3 of 10 rows have salary IS NULL, the average is computed over the other 7. This is why AVG(salary) and SUM(salary) / COUNT(*) produce different numbers whenever the column contains nulls — the second expression divides by 10, not 7.

MIN and MAX — extremes

MIN and MAX work on numbers, dates, and strings.

SELECT
    user_id,
    MIN(order_date) AS first_order,
    MAX(order_date) AS last_order
FROM orders
GROUP BY user_id;

For strings, MIN returns the lexicographically smallest value and MAX the largest. For dates, MIN is the earliest moment and MAX the latest. Both ignore NULL. A common interview trick: ask for "the most recent login per user" — that is just MAX(login_at) GROUP BY user_id, no window function required.

GROUP BY: aggregation by group

An aggregate without GROUP BY collapses the entire table into a single row. With GROUP BY, it collapses each group separately.

SELECT
    category,
    COUNT(*) AS order_count,
    SUM(amount) AS revenue,
    ROUND(AVG(amount), 2) AS avg_check
FROM orders
GROUP BY category
ORDER BY revenue DESC;

When GROUP BY is present, the SELECT clause may only reference columns listed in GROUP BY or wrapped in an aggregate. SELECT category, product_name, COUNT(*) FROM orders GROUP BY category is a syntax error in standard SQL because product_name is neither grouped nor aggregated. Postgres, Snowflake, and BigQuery all enforce this; MySQL historically allowed it under ONLY_FULL_GROUP_BY = OFF, a known footgun.

Group by multiple columns and every unique combination becomes its own group:

SELECT
    DATE_TRUNC('month', order_date) AS month,
    city,
    COUNT(DISTINCT user_id) AS buyers
FROM orders
GROUP BY DATE_TRUNC('month', order_date), city;

Load-bearing trick: if your monthly numbers look too high after a JOIN, you probably need COUNT(DISTINCT user_id) instead of COUNT(*). A user with three orders becomes three rows after joining orders to users, and COUNT(*) will triple-count them.

HAVING: filtering after aggregation

WHERE filters rows before the grouping happens and cannot contain aggregate functions. HAVING filters groups after the aggregation has been computed, and aggregates are perfectly legal there.

SELECT
    user_id,
    COUNT(*) AS order_count,
    SUM(amount) AS total_spent
FROM orders
WHERE order_date >= '2025-01-01'
GROUP BY user_id
HAVING COUNT(*) >= 5 AND SUM(amount) > 10000
ORDER BY total_spent DESC;

Read it top to bottom in execution order: WHERE keeps only the 2025 orders. GROUP BY collapses what remains into one row per user. HAVING keeps only those users with five or more orders and more than $10,000 spent. ORDER BY sorts the survivors. Putting the aggregate filter in WHERE is a guaranteed syntax error in every dialect.

NULL and aggregate functions

This is the table that wins or loses interview points. Memorize it.

Function NULL behavior Returns when all values are NULL
COUNT(*) Counts every row, including all-NULL rows The row count (often non-zero)
COUNT(column) Skips rows where column IS NULL 0
SUM(column) Ignores NULL values NULL
AVG(column) Ignores NULL in both numerator and denominator NULL
MIN(column) Ignores NULL NULL
MAX(column) Ignores NULL NULL

The consequence: AVG(column) != SUM(column) / COUNT(*) whenever column contains nulls, because the right-hand side divides by every row while AVG divides only by the non-null ones. If you want a "treat null as zero" average, write AVG(COALESCE(column, 0)) explicitly. This single distinction shows up on at least one screen per loop at most data-driven companies.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Conditional aggregation with CASE

Wrapping CASE inside COUNT or SUM lets you compute several conditional metrics in a single pass over the table — far cheaper than running multiple queries and joining them.

Conversion rate in one query:

SELECT
    DATE_TRUNC('day', visit_date) AS day,
    COUNT(DISTINCT user_id) AS visitors,
    COUNT(DISTINCT CASE WHEN has_purchase THEN user_id END) AS buyers,
    ROUND(
        COUNT(DISTINCT CASE WHEN has_purchase THEN user_id END)::numeric
        / NULLIF(COUNT(DISTINCT user_id), 0) * 100, 2
    ) AS conversion_pct
FROM user_visits
GROUP BY DATE_TRUNC('day', visit_date)
ORDER BY day;

Revenue split by segment in one pass:

SELECT
    SUM(CASE WHEN segment = 'new' THEN amount ELSE 0 END) AS new_revenue,
    SUM(CASE WHEN segment = 'returning' THEN amount ELSE 0 END) AS returning_revenue,
    SUM(amount) AS total_revenue
FROM orders;

The pattern: COUNT(CASE WHEN condition THEN 1 END) counts rows that satisfy the condition (because CASE returns NULL otherwise, and COUNT(column) skips nulls). SUM(CASE WHEN condition THEN amount ELSE 0 END) totals values when the condition holds. Drill this before any analyst interview — at Netflix and Uber, expect to see it on the SQL homework.

Worked examples

DAU — daily active users

SELECT
    DATE_TRUNC('day', event_time) AS day,
    COUNT(DISTINCT user_id) AS dau
FROM events
WHERE event_time >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE_TRUNC('day', event_time)
ORDER BY day;

The non-negotiable detail is COUNT(DISTINCT user_id). Without DISTINCT you are counting events, not users, and a single power user with 500 sessions will dominate the chart. Half of all junior candidates fail this version of the question by writing COUNT(*) and calling it DAU.

Revenue and buyers by segment, current month

SELECT
    u.segment,
    COUNT(DISTINCT o.user_id) AS buyers,
    SUM(o.amount) AS revenue,
    ROUND(AVG(o.amount), 2) AS avg_check
FROM orders o
JOIN users u ON o.user_id = u.id
WHERE o.order_date >= DATE_TRUNC('month', CURRENT_DATE)
GROUP BY u.segment
ORDER BY revenue DESC;

Three aggregates in one query, grouped by a column from the joined table. Note the COUNT(DISTINCT o.user_id) — after a JOIN, the same user may appear on multiple rows, and a naive COUNT(*) would inflate the buyer count by the average orders-per-user factor.

Common pitfalls

The most common mistake on real interviews is a column in SELECT that is neither aggregated nor in GROUP BY. SELECT user_id, name, COUNT(*) FROM orders GROUP BY user_id is invalid: name is dangling. Either add name to GROUP BY (fine if every user_id has exactly one name) or wrap it in MAX(name). This trips juniors because some MySQL configurations silently accept the query and return whichever name the engine picks first — a non-deterministic result that explodes in production.

A close second is putting an aggregate in WHERE. WHERE COUNT(*) > 5 is illegal because grouping has not happened yet at the row-filtering stage. The aggregate filter belongs in HAVING. The mental model: WHERE operates on rows, HAVING on groups. The execution order is FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY, and any aggregate has to wait until GROUP BY finishes.

A third trap is using COUNT(*) after a JOIN when you wanted unique entities. If users joins to orders one-to-many, SELECT COUNT(*) FROM users JOIN orders ON ... returns the count of orders, not users with at least one order. Use COUNT(DISTINCT user_id) whenever the join can multiply rows. This causes half the "our DAU jumped 3x overnight" Slack messages in any analytics channel.

Pitfall four: AVG with nulls when you wanted a zero-imputed average. If a survey column has nulls for non-respondents and you want their score to count as zero, AVG(score) silently excludes them and inflates the result. Write AVG(COALESCE(score, 0)) when imputation is the intent, and document it — the next reader will not know which interpretation you chose.

Pitfall five: SUM returning NULL instead of 0. If every row in a group is null, SUM returns NULL, and downstream arithmetic also becomes NULL. This is how dashboard cells go blank for one segment and not others. Wrap aggregate sums in COALESCE(SUM(column), 0) when the column is nullable and a missing total should display as zero.

Optimization tips

For large tables, the biggest win is a covering index on the GROUP BY column when the table is filtered by something selective in WHERE first. In Postgres, CREATE INDEX ON orders (order_date, category, amount) lets the planner do an index-only scan for daily category breakdowns. On Snowflake or BigQuery, the equivalent is clustering on the high-cardinality GROUP BY key.

COUNT(DISTINCT) is the expensive aggregate. Warehouse APPROX_COUNT_DISTINCT (BigQuery, Snowflake, Redshift) drops the cost by an order of magnitude with an error budget around 2% — fine for trend dashboards. Reserve exact COUNT(DISTINCT) for finance and compliance numbers.

Materialize repeated aggregates. If the same GROUP BY category, day query runs hourly off a 100M-row events table, build a daily rollup table and query that. A 100x query speedup for a one-time write is the easiest performance win in analytics engineering.

If you want to drill aggregate-function SQL questions every day until they feel automatic, NAILDD is launching with hundreds of SQL problems built around exactly these patterns — COUNT variants, conditional aggregation, HAVING traps, and the NULL edge cases interviewers love.

FAQ

How are aggregate functions different from window functions?

Aggregate functions with GROUP BY collapse rows into a single row per group. Window functions compute a value for each row without collapsing the result. SUM(amount) ... GROUP BY user_id returns one row per user. SUM(amount) OVER (PARTITION BY user_id) keeps every row and adds a per-user total to each. Use aggregates for summary tables, windows for detail rows with extra context.

What does SUM return when all input values are NULL?

NULL, not 0. This is a common source of blank dashboard cells in production. If a downstream chart expects a numeric value, wrap the aggregate in COALESCE(SUM(column), 0). The same rule applies to AVG, MIN, and MAX — they all return NULL when every input is null.

Can I combine multiple aggregate functions in one SELECT?

Yes, and you should. SELECT category, COUNT(*), SUM(amount), AVG(amount), MIN(amount), MAX(amount) FROM orders GROUP BY category runs as a single pass. There is no penalty for stacking aggregates — five in one query is cheaper than five separate queries because the engine scans rows once.

When should I use COUNT(DISTINCT) vs COUNT(*)?

Use COUNT(*) for number of rows — total orders, events, sessions. Use COUNT(DISTINCT column) for unique entities — unique users, products, sessions per user. After any JOIN that can multiply rows, default to COUNT(DISTINCT) unless the join is strictly one-to-one.

Why does AVG behave differently from SUM divided by COUNT?

AVG(column) divides the sum of non-null values by the count of non-null values. SUM(column) / COUNT(*) divides the sum by all rows, null or not. With nulls present, the second expression returns a smaller number. Be explicit: AVG(column) for "average over respondents", AVG(COALESCE(column, 0)) for "treat missing as zero".

How do I practice aggregate functions for interviews?

Write them by hand. Build a synthetic table with deliberate nulls and duplicates, then write every variant: COUNT(*) vs COUNT(column) vs COUNT(DISTINCT column), AVG vs SUM/COUNT, conditional aggregation with CASE, HAVING with multiple conditions. Muscle memory of typing these correctly under interview pressure is what separates a passing screen from a borderline one.