How to calculate Active Cohort in SQL

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

What Active Cohort actually answers

Your PM pings you on a Wednesday: "How many of the users we acquired in January are still actively using the product three months in?" That is what Active Cohort answers, and most analysts get it subtly wrong on the first pass. They count signups, they count logins, they confuse "registered" with "doing the thing that matters". By Friday standup the dashboard disagrees with the growth team, and trust evaporates.

Active Cohort is the absolute count of unique users from acquisition cohort m who performed a key action during reporting period p. It is not retention. Retention is a ratio, Active Cohort is a count. Retention tells you what fraction of a cohort survived; Active Cohort tells you the raw number of humans you can put a revenue-per-active or notification-volume forecast against.

This metric shows up in product reviews, board decks, and senior data analyst interviews at Stripe, Notion, DoorDash, and Linear. If you cannot write the SQL in under fifteen minutes you will struggle in the take-home. If you write it but skip data lag, bots, and immature cohorts, you will stumble on the follow-up. This post fixes both halves.

The SQL formula

The plain-English formula is short. Active Cohort for acquisition month m in reporting period p equals the count of distinct users who signed up in m and performed at least one key action during p. The key action is product-specific: for Linear "created or commented on an issue", for DoorDash "placed an order", for Notion "edited a page". The biggest argument your team will have is what counts as a key action — write that definition down before writing SQL.

Once the definition is locked, the SQL is two CTEs and a join. The first assigns every user to their acquisition cohort. The second collapses raw events into "this user did a key action in this month".

WITH cohort AS (
    SELECT
        user_id,
        DATE_TRUNC('month', signup_date) AS cohort_month
    FROM users
),
activity AS (
    SELECT
        user_id,
        DATE_TRUNC('month', event_date) AS active_month
    FROM events
    WHERE event_type IN ('key_action_1', 'key_action_2')
    GROUP BY user_id, DATE_TRUNC('month', event_date)
)
SELECT
    c.cohort_month,
    a.active_month,
    COUNT(DISTINCT c.user_id) AS active_users
FROM cohort c
JOIN activity a ON a.user_id = c.user_id
GROUP BY c.cohort_month, a.active_month
ORDER BY c.cohort_month, a.active_month;

Read it top to bottom. The cohort CTE has one row per user tagged with their signup month. The activity CTE has one row per user-month where that user did a key action — the GROUP BY collapses many events into one "active this month" flag. The final SELECT joins on user_id and COUNT(DISTINCT) rolls up to cohort counts.

A common interview follow-up is "what if a user does fifty key actions in one month?" The answer is COUNT(DISTINCT user_id) plus the inner GROUP BY — both deduplicate. Say that out loud in a live SQL round; interviewers want to hear you reason about double-counting before they ask.

Worked example: cohort comparison

The PM question is comparative: "Is the March 2026 cohort doing better or worse than January at month three?" To answer it you need cohort size as a denominator and an m_offset column that aligns cohorts on "months since signup".

WITH cohort_size AS (
    SELECT
        DATE_TRUNC('month', signup_date) AS cohort_month,
        COUNT(*) AS total_signups
    FROM users
    GROUP BY 1
),
cohort_active AS (
    SELECT
        c.cohort_month,
        EXTRACT(YEAR FROM AGE(DATE_TRUNC('month', e.event_date), c.cohort_month)) * 12
            + EXTRACT(MONTH FROM AGE(DATE_TRUNC('month', e.event_date), c.cohort_month)) AS m_offset,
        COUNT(DISTINCT c.user_id) AS active
    FROM (
        SELECT user_id, DATE_TRUNC('month', signup_date) AS cohort_month
        FROM users
    ) c
    JOIN events e ON e.user_id = c.user_id
    WHERE e.event_type IN ('key_action_1', 'key_action_2')
    GROUP BY 1, 2
)
SELECT
    ca.cohort_month,
    ca.m_offset,
    ca.active,
    cs.total_signups,
    ROUND(ca.active::NUMERIC * 100 / NULLIF(cs.total_signups, 0), 2) AS active_rate_pct
FROM cohort_active ca
JOIN cohort_size cs ON cs.cohort_month = ca.cohort_month
WHERE ca.m_offset BETWEEN 0 AND 12
ORDER BY ca.cohort_month, ca.m_offset;

The m_offset arithmetic is the safest cross-database way to compute "months between two dates". The EXTRACT(YEAR FROM AGE(...)) * 12 + EXTRACT(MONTH FROM AGE(...)) pattern works on Postgres and Snowflake. On BigQuery use DATE_DIFF(active_month, cohort_month, MONTH). Memorize both — interviewers will specify the dialect.

Read the output as a classic cohort table: rows are acquisition cohorts, m_offset is age, active_rate_pct is comparable across cohorts. If active_rate_pct for the 2026-03 cohort at M3 is materially below 2026-01 at M3, something changed in activation — onboarding, pricing, an ad-channel shift, a release that quietly broke the new-user funnel. That is the conversation Active Cohort exists to trigger.

The next ask is usually a pivot — one row per cohort, columns for active rate at M1, M3, M6. It fits on a slide and a growth lead can scan it in five seconds.

SELECT
    cohort_month,
    MAX(CASE WHEN m_offset = 1 THEN active_rate_pct END) AS m1_active,
    MAX(CASE WHEN m_offset = 3 THEN active_rate_pct END) AS m3_active,
    MAX(CASE WHEN m_offset = 6 THEN active_rate_pct END) AS m6_active
FROM cohort_active_with_rate
GROUP BY cohort_month
ORDER BY cohort_month;

The MAX(CASE WHEN ...) pattern is the canonical SQL pivot. Each (cohort_month, m_offset) pair is unique, so MAX picks that single value. Window functions here overcomplicate it.

When you ship to a dashboard, cohorts that have not aged enough show NULL in the right columns — the 2026-04 cohort cannot have an M6 value in May 2026. Filter those rows out or render as "—" so nobody mistakes a NULL for "0% active".

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Common pitfalls

The most damaging pitfall is leaving the definition of "active" vague. "Opened the app" and "completed a purchase" are wildly different bars, and your number will be off by an order of magnitude depending on which you pick. Lock the definition with PM and growth before writing any SQL, document it in the metric README, and reference that README in every report. Skip this and you burn a sprint arguing why your dashboard disagrees with the marketing team's spreadsheet.

The second trap is ignoring data lag. Many event pipelines have a 24- to 72-hour ingestion delay, longer for mobile clients that batch uploads. If your cohort-2026-03 row shows a sudden drop at M2, check the lag before calling it a regression. The standard fix is WHERE active_month < DATE_TRUNC('month', CURRENT_DATE) so you never report a partially loaded current month.

The third pitfall is comparing immature cohorts. Your January cohort has four months of post-signup data by May; your April cohort has zero full months. On the same chart the April cohort will look catastrophically bad. The fix is the M-offset framing above — compare each cohort at the same age, not the same calendar date.

The fourth pitfall is bot and crawler activity. Scrapers, headless test accounts, and your own QA team register and "use" the product. They inflate Active Cohort a few percent on mature products and double digits on early-stage ones. Filter known bot IPs, exclude internal email domains, and apply a heuristic on session length or action diversity.

The fifth pitfall is cross-device under-deduplication. A user who opens iOS, Android, and web in the same month is one human. If user_id is unified across devices via login state, COUNT(DISTINCT user_id) handles this. If you still use device-level identifiers for anonymous sessions, you will overcount. Sanity-check by joining to the identity-resolution table before publishing.

Optimization tips

For large event tables, the events scan inside the activity CTE is the bottleneck. The first lever is partitioning on event_date. Snowflake and BigQuery partition natively; on Postgres declare monthly partitions. With partitioning a query for "active in 2026-03" reads one partition — on a billion-row log, the difference between four seconds and four minutes.

The second lever is a pre-aggregated monthly_active_users table refreshed by a dbt model overnight, shaped (user_id, active_month, key_action_count). Active Cohort then becomes a join between cohort and this pre-aggregate — two orders of magnitude smaller than the raw event stream. This is the standard Snowflake and Databricks pattern.

The third lever is materialized views for the comparison query itself. If ten people refresh a dashboard every morning, materialize the result on daily refresh, not on-demand compute. The trade-off is freshness — confirm overnight is acceptable before caching.

If you want to drill cohort SQL questions like this daily, NAILDD is launching with 500+ SQL problems on exactly this pattern.

FAQ

What's the difference between Active Cohort and retention?

Retention is a ratio — the fraction of a cohort still active at a given age — and Active Cohort is the absolute count of distinct users. You derive retention by dividing Active Cohort by cohort size, which is exactly what active_rate_pct does in the worked example. Both live on the same dashboard because PMs want the "how many humans" and the "what percent" framings.

How should I define "active" for a SaaS product?

Pick one or two key actions that correlate with paid conversion and renewal, not just login. A login event is too easy to trigger — push notifications, email pixel loads, and automated logins all fire it. The Stripe and Notion playbook is login plus one product-creating action per period: a paid charge for Stripe, an edited page for Notion. Version the definition and treat changes with the rigor of a metric migration.

Active Cohort is rising month over month — is that good news?

Not necessarily. Active Cohort scales mechanically with cohort size, so if you acquired twice as many users in March, Active Cohort at M1 will be higher even if the product got worse. Pair the absolute count with active_rate_pct. If both rise, the product is healthier. If only the count rises, you are paying more for the same engagement.

Can I compare cohorts of different ages directly?

No — this is the most common rookie mistake. Compare every cohort at the same m_offset, never the same calendar month. The 2026-01 cohort at M3 versus 2026-03 at M3 is fair. The 2026-01 cohort in May versus 2026-03 in May is not — one has aged five months, the other two.

A user was active in M1, inactive in M2, active again in M3 — how do they show up?

That user is counted in M1 and M3 but not M2. They are simply "active in that period" per period. To track re-engagement separately, build a "resurrected users" metric that flags users inactive for at least one full period who then returned. Resurrection is a useful diagnostic when Active Cohort recovers after a dip — it tells you whether recovery came from new users or returning ones.

How long should the cohort window be before I report?

For consumer apps with weekly engagement, two to four weeks of post-signup data is enough to stabilize M1. For B2B SaaS with monthly or quarterly rhythms, wait at least one full M1 period plus the data-lag window. Reporting before maturity forces you to walk numbers back in front of leadership — far more expensive than a short delay.