May 18, 2026·12 min read

How to calculate cumulative sum in SQL

Q: Why does cumulative DISTINCT not work in PostgreSQL?

The SQL standard does allow `COUNT(DISTINCT)` as a window function, but Postgres, MySQL, and several older Hive versions never implemented it. The reason is performance: distinct counts are not incrementally computable the way sums and counts are, so each row in the window would need to rescan the entire prior history. The first-seen CTE pattern in this post is the standard workaround across every major engine.

Q: When should I write UNBOUNDED PRECEDING explicitly?

Always, in interview answers. The default frame with `ORDER BY` is `RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`, which is almost but not quite the same as `ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW`. The `RANGE` default collapses tied rows into the same cumulative value, while `ROWS` is strict row-by-row. Interviewers notice when you write the frame explicitly because it signals you understand the trap.

Q: How do I compute cumulative sum across BigQuery, Snowflake, and Postgres?

The syntax is identical for the basic case: `SUM(x) OVER (ORDER BY date)`. BigQuery and Snowflake add `QUALIFY` for filtering on window results, and Snowflake supports `COUNT(DISTINCT)` inside windows on smaller tables. Postgres requires the first-seen rewrite for cumulative distinct. Everything else — partitioning, frames, ordering — works the same way on every engine.

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Join the waitlist

Contents:

What cumulative sum is and why analysts care
The basic SQL formula
Cumulative sum by group
Lifetime revenue with milestones
Cumulative distinct users
Cumulative percent and Pareto
Common pitfalls
Optimization tips
Related reading
FAQ

What cumulative sum is and why analysts care

Picture Monday morning at Stripe. A PM pings you on Slack: "How much has the new self-serve plan made since launch? Send me a chart by 11am." Daily revenue is noisy and spiky, and a bar chart of it tells a confusing story. What the PM actually wants is a smooth, monotonically increasing line — total revenue from launch day up to each calendar day. That is a cumulative sum, also called a running total.

Cumulative sum is the most common window function pattern in data analyst interviews at Meta, DoorDash, Airbnb, and Snowflake. It shows whether you understand the OVER clause, frames, and the difference between aggregating across a partition and aggregating along an ordered sequence. Once you internalize the pattern, the same shape solves dozens of downstream problems — lifetime revenue, cumulative active users, Pareto curves, balance reconciliation, and milestone tracking.

This post walks through a worked SQL recipe with six variations, the traps that catch even mid-level candidates, and an FAQ that mirrors the follow-up questions Amazon and Microsoft loops ask after you hand in your first draft.

The basic SQL formula

The canonical form uses SUM() as a window function with ORDER BY and an explicit frame:

SELECT
    DATE,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY DATE
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_revenue
FROM daily_revenue
ORDER BY DATE;

Read it left to right. SUM(daily_revenue) is the aggregate. OVER (...) turns it into a window function instead of a GROUP BY aggregate, so every input row still appears in the output. ORDER BY date sorts the rows inside the window. ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW says: for each row, look at every row from the start of the partition up to and including this one, then sum their daily_revenue values.

The shorter form is identical in Postgres, Snowflake, BigQuery, and Databricks — the default frame for OVER with ORDER BY is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW:

SELECT
    DATE,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY DATE) AS cumulative_revenue
FROM daily_revenue
ORDER BY DATE;

The two forms differ in one subtle way that bites in interviews: the explicit ROWS version uses physical row positions, while the implicit default uses RANGE, which groups ties together. If you have two rows with the same date, RANGE lumps them into the same cumulative value, while ROWS increments strictly one row at a time. The fix is to always write the frame explicitly when you have potential ties or to disambiguate the order with a tiebreaker column.

Cumulative sum by group

The PM asks the next question: "Break it down by acquisition channel." You do not want a single global running total — you want one running total per channel, each restarting at zero on its own first day. That is what PARTITION BY is for:

SELECT
    DATE,
    channel,
    daily_revenue,
    SUM(daily_revenue) OVER (
        PARTITION BY channel
        ORDER BY DATE
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_revenue_channel
FROM channel_revenue
ORDER BY channel, DATE;

PARTITION BY channel slices the data into independent buckets — one for paid search, one for organic, one for referral — and the cumulative sum restarts inside each bucket. This is the same mental model as GROUP BY, except the original rows survive instead of collapsing into one summary row per group. If you later want to compare channels on the same chart, the per-channel running totals plot cleanly without any post-processing.

A common follow-up at Uber and Linear is "show me cumulative revenue per channel as a share of the all-channel cumulative." That requires two window functions in the same query — one partitioned and one not — and is covered in the cumulative percent section below.

Lifetime revenue with milestones

Founders and CEOs care about milestone moments: first $10k, first $100k, first $1M. You can compute the running total and tag the milestone in one pass using a CASE over the same window expression:

WITH daily AS (
    SELECT
        DATE_TRUNC('day', created_at) AS day,
        SUM(amount) AS daily_revenue
    FROM transactions
    WHERE status = 'paid'
    GROUP BY 1
)
SELECT
    day,
    daily_revenue,
    SUM(daily_revenue) OVER (ORDER BY day) AS lifetime_revenue,
    CASE
        WHEN SUM(daily_revenue) OVER (ORDER BY day) >= 1000000 THEN 'past $1M'
        WHEN SUM(daily_revenue) OVER (ORDER BY day) >= 100000  THEN 'past $100k'
        WHEN SUM(daily_revenue) OVER (ORDER BY day) >= 10000   THEN 'past $10k'
        ELSE 'starting'
    END AS milestone
FROM daily
ORDER BY day;

Two things to notice. First, the CASE re-evaluates the window function on every branch — that is wasted compute on huge tables, but the query planner in Snowflake, BigQuery, and Postgres 15+ will common-subexpression-eliminate it for you. Second, if you want the exact day each milestone was crossed, wrap the result in another CTE and filter for the row where the previous row's lifetime_revenue was below the threshold and the current row crosses it — a classic use of LAG().

Cumulative distinct users

The trap question. The interviewer at Airbnb asks: "I want cumulative distinct active users — total people who have ever logged in by each calendar day." Your instinct is COUNT(DISTINCT user_id) OVER (ORDER BY day), but most databases (Postgres, MySQL, older Redshift) do not allow DISTINCT inside a window function. Snowflake and BigQuery do, but it is slow on large tables.

The clean workaround uses a first-seen CTE: compute each user's first activity date once, then count how many users were first seen on or before each calendar day:

WITH first_seen AS (
    SELECT
        user_id,
        MIN(created_at::DATE) AS first_date
    FROM events
    GROUP BY user_id
),
calendar AS (
    SELECT generate_series(
        (SELECT MIN(first_date) FROM first_seen),
        CURRENT_DATE,
        '1 day'::INTERVAL
    )::DATE AS day
)
SELECT
    c.day,
    COUNT(f.user_id) FILTER (WHERE f.first_date = c.day) AS new_users,
    COUNT(f.user_id)                                     AS cumulative_users
FROM calendar c
LEFT JOIN first_seen f ON f.first_date <= c.day
GROUP BY c.day
ORDER BY c.day;

This pattern shows up so often in interview loops at Notion, Vercel, and Figma that it has its own nickname: "first-seen join." It also generalizes to cumulative distinct sessions, cumulative distinct paying customers, and cumulative distinct creators — swap the entity, keep the shape.

Sharpen SQL for your next interview

500+ SQL problems with worked solutions — joins, window functions, CTEs.

Join the waitlist

Cumulative percent and Pareto

Pareto analysis answers "what share of revenue comes from the top N customers?" The recipe is one window function for the running total of revenue (sorted descending) and one window function for the grand total (no ORDER BY), then divide:

SELECT
    customer_id,
    revenue,
    SUM(revenue) OVER ()                       AS total_revenue,
    SUM(revenue) OVER (ORDER BY revenue DESC)  AS cum_revenue,
    100.0 * SUM(revenue) OVER (ORDER BY revenue DESC)
          / SUM(revenue) OVER ()               AS cum_pct
FROM customers
ORDER BY revenue DESC;

Scroll down the result until cum_pct crosses 80. The customer on that row is the boundary of your "top 80% of revenue" set. At most consumer marketplaces — DoorDash, Airbnb, Uber Eats — that boundary sits around the top 15-25% of customers, which is where the colloquial "80/20 rule" comes from. The same shape works for SKUs, content creators, sales reps, and ad campaigns.

Common pitfalls

The first pitfall is forgetting ORDER BY inside the window. SUM(x) OVER () with no ORDER BY and no PARTITION BY is a grand total replicated across every row, not a running total. Candidates routinely paste this in a screen share, see the same number on every row, and panic. The fix is mechanical: cumulative sums always need an ordering inside the window, otherwise SQL has no notion of "rows so far."

The second pitfall is COUNT(DISTINCT x) OVER (...) for a running distinct count. Most engines reject it outright, and the ones that accept it execute it at quadratic cost. Use the first-seen pattern instead. If your interviewer pushes back with "but Snowflake supports it" — that is true, and you can use it on small tables, but production-scale running distincts almost always go through the first-seen rewrite for performance reasons.

The third pitfall is gaps in the date series. If your daily_revenue table only has rows for days with non-zero sales, the running total skips the silent days entirely. That looks fine on a line chart but breaks any join against a fact table that does have those dates. The fix is to left-join against a generated calendar table (generate_series in Postgres, UNNEST(GENERATE_DATE_ARRAY(...)) in BigQuery, sequence views in Snowflake) before applying the window.

The fourth pitfall is the RANGE versus ROWS confusion. The default frame when you write ORDER BY without an explicit frame is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, and RANGE collapses ties — two rows with the same date will end up with the same cumulative value, which is rarely what you want for transactional data. Always write ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW explicitly, or break ties in the ORDER BY with a unique column like transaction_id.

The fifth pitfall is expecting PARTITION BY to behave like GROUP BY and leak across partitions. It does not. Each partition is independent, the running total restarts at zero on the first row of each partition, and there is no built-in way to ask "running total per channel but also continue from the previous channel's last value." If that is what you need, you are looking at a single global window with a channel label, not a partitioned one.

Optimization tips

On tables larger than ~100 million rows, the cost of a cumulative sum is dominated by the sort the window requires. Make sure the ORDER BY column is the leading column of a clustering key in Snowflake, a partition key in BigQuery, or a btree index in Postgres. Without that, the engine materializes the entire table and sorts it — easily a 10x slowdown.

For per-partition running totals, choose PARTITION BY carefully: high-cardinality keys (millions of distinct values) hurt more than help because each partition triggers its own sort. With 5-20 partitions, the partitioned plan is usually faster than the global one.

If you compute the same running total in many queries, materialize it once into a daily snapshot table and append a single new row per day with an incremental job. The cumulative value of each new day is just the previous day's cumulative plus today's daily. This trick converts a window function on the full history into a tiny insert and is how analytics teams at Databricks and Snowflake themselves run their own internal metrics.

If you want to drill SQL patterns like cumulative sums every day until they are muscle memory, NAILDD is launching with 500+ analyst interview problems built around exactly this shape.

FAQ

What does SUM OVER without ORDER BY actually return?

It returns a grand total replicated on every row. The window covers the entire partition (or the entire table if there is no PARTITION BY) with no ordering, so the engine has no notion of "rows so far" — it just sums everything. This is useful for computing percentages-of-total in the same query, but it is never a running total.

Why does cumulative DISTINCT not work in PostgreSQL?

The SQL standard does allow COUNT(DISTINCT) as a window function, but Postgres, MySQL, and several older Hive versions never implemented it. The reason is performance: distinct counts are not incrementally computable the way sums and counts are, so each row in the window would need to rescan the entire prior history. The first-seen CTE pattern in this post is the standard workaround across every major engine.

When should I write UNBOUNDED PRECEDING explicitly?

Always, in interview answers. The default frame with ORDER BY is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which is almost but not quite the same as ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. The RANGE default collapses tied rows into the same cumulative value, while ROWS is strict row-by-row. Interviewers notice when you write the frame explicitly because it signals you understand the trap.

Does PARTITION BY reset the cumulative sum?

Yes, exactly at the first row of each new partition. If you wanted the running total to continue across partitions but still be labeled by channel, do not use PARTITION BY — use a single global window and project the channel column. If you wanted independent running totals per channel that restart at zero, use PARTITION BY channel and you are done.

How is cumulative DAU different from a running total of DAU?

Cumulative DAU usually means "total distinct users ever active by day X" — once a user shows up they are counted forever. A running total of DAU is the sum of the daily-active-user counts, which double-counts the same user every day they return. The first metric is monotonically increasing and answers "how big has our audience grown." The second is a meaningless ever-growing number that confuses dashboards. Always ask the PM which one they actually want.

How do I compute cumulative sum across BigQuery, Snowflake, and Postgres?

The syntax is identical for the basic case: SUM(x) OVER (ORDER BY date). BigQuery and Snowflake add QUALIFY for filtering on window results, and Snowflake supports COUNT(DISTINCT) inside windows on smaller tables. Postgres requires the first-seen rewrite for cumulative distinct. Everything else — partitioning, frames, ordering — works the same way on every engine.