How to calculate Cohort Retention in SQL
Contents:
What Cohort Retention actually answers
Your PM walks into Monday standup with a single line: "Retention is thirty percent — is that good?" Headline retention is a number that hides everything interesting. The January cohort might be holding at forty-five percent, the February cohort at thirty-five, and the March cohort already collapsed to twenty-two. The average looks stable; the trend is on fire. Cohort Retention is the cut that exposes which acquisition month, which channel, or which onboarding revision actually broke.
A cohort is a group of users who share an entry event — almost always signup date, sometimes first purchase or activation. Cohort Retention is the fraction of that group still active N periods after entry. The whole point is to compare cohorts at the same age, not the same calendar date. The 2026-01 cohort at month three versus the 2026-03 cohort at month three is fair; comparing them on the same May dashboard is not.
This metric is a daily-driver at Stripe, Notion, DoorDash, Linear, Airbnb, and every consumer subscription company you can name. It shows up in product reviews, board decks, and senior data analyst interviews — usually phrased as "write the SQL on the whiteboard, then tell me what you would investigate if the curve dropped." If you cannot reach a runnable cohort retention query in fifteen minutes and explain why each CTE exists, you will struggle in the take-home. This post fixes both halves.
The SQL formula
Start with two tables that every analytics warehouse has in some shape: users(user_id, signup_date) and events(user_id, event_date). The first CTE assigns each user to their acquisition cohort by truncating signup to month. The second collapses raw events into one row per user-month — the inner aggregation makes "active in M" a binary fact regardless of how many events the user generated. The outer SELECT joins the two and computes the months-since-signup offset.
WITH cohort AS (
SELECT
user_id,
DATE_TRUNC('month', signup_date) AS cohort_month
FROM users
WHERE signup_date >= '2026-01-01'
),
activity AS (
SELECT DISTINCT
user_id,
DATE_TRUNC('month', event_date) AS active_month
FROM events
)
SELECT
c.cohort_month,
a.active_month,
EXTRACT(YEAR FROM AGE(a.active_month, c.cohort_month)) * 12
+ EXTRACT(MONTH FROM AGE(a.active_month, c.cohort_month)) AS month_offset,
COUNT(DISTINCT c.user_id) AS active_users
FROM cohort c
LEFT JOIN activity a ON a.user_id = c.user_id
GROUP BY c.cohort_month, a.active_month
ORDER BY c.cohort_month, a.active_month;The EXTRACT(YEAR FROM AGE(...)) * 12 + EXTRACT(MONTH FROM AGE(...)) arithmetic is the safest cross-database way to compute the gap between two month-truncated dates on Postgres and Snowflake. Skipping the YEAR * 12 term is the single most common mistake — EXTRACT(MONTH FROM AGE(...)) alone returns zero through eleven, so a user active eighteen months after signup looks like a six-month-old retention point. On BigQuery use DATE_DIFF(active_month, cohort_month, MONTH) and on Redshift use DATEDIFF(month, cohort_month, active_month). Memorize all three — interviewers will pin you to a dialect.
The LEFT JOIN matters. With an INNER JOIN, any cohort whose users have zero activity vanishes from the result, and you cannot tell the difference between "no users" and "users churned hard". With LEFT JOIN, dead cohorts surface as rows with NULL active_month and zero active users — which is exactly the signal you want on a dashboard.
Retention curves and the pivot view
Once the long-format result above is correct, the next ask is always a pivot — one row per cohort with retention percentages at fixed offsets like M1, M3, and M6. That shape fits on a slide, a growth lead can scan it in five seconds, and the curve plots in any BI tool with zero post-processing.
WITH cohort AS (
SELECT user_id, DATE_TRUNC('month', signup_date) AS cohort_month
FROM users
WHERE signup_date >= '2026-01-01'
),
events_monthly AS (
SELECT user_id, DATE_TRUNC('month', event_date) AS active_month
FROM events
GROUP BY user_id, DATE_TRUNC('month', event_date)
),
joined AS (
SELECT
c.cohort_month,
(EXTRACT(YEAR FROM AGE(e.active_month, c.cohort_month)) * 12
+ EXTRACT(MONTH FROM AGE(e.active_month, c.cohort_month)))::INT AS month_offset,
c.user_id
FROM cohort c
LEFT JOIN events_monthly e ON e.user_id = c.user_id
)
SELECT
cohort_month,
COUNT(DISTINCT user_id) AS cohort_size,
ROUND(COUNT(DISTINCT CASE WHEN month_offset = 1 THEN user_id END)::NUMERIC
* 100 / NULLIF(COUNT(DISTINCT user_id), 0), 2) AS retention_m1,
ROUND(COUNT(DISTINCT CASE WHEN month_offset = 3 THEN user_id END)::NUMERIC
* 100 / NULLIF(COUNT(DISTINCT user_id), 0), 2) AS retention_m3,
ROUND(COUNT(DISTINCT CASE WHEN month_offset = 6 THEN user_id END)::NUMERIC
* 100 / NULLIF(COUNT(DISTINCT user_id), 0), 2) AS retention_m6
FROM joined
GROUP BY cohort_month
ORDER BY cohort_month;The COUNT(DISTINCT CASE WHEN ... THEN user_id END) pattern is the canonical SQL approach for cohort pivots. Each filtered count produces the numerator at that offset, and NULLIF on the denominator guards against zero-division for an empty cohort. The ROUND(..., 2) keeps the dashboard readable — two decimal places communicate the trend without implying false precision.
When you ship this to a dashboard, cohorts that have not aged enough show NULL in the right-most columns. A 2026-04 cohort cannot have an M6 value in May 2026, and a NULL there is correct. Render those cells as an em-dash or hide them, so nobody mistakes a NULL for zero retention.
Slicing by segment
The follow-up question is always "okay, but where is retention falling — which channel, which platform, which plan?" The answer is the same query with an extra column carried through every CTE. The shape below extends the basic recipe to acquisition channel and keeps the M-offset arithmetic intact.
WITH cohort AS (
SELECT
user_id,
DATE_TRUNC('month', signup_date) AS cohort_month,
acquisition_channel
FROM users
WHERE signup_date >= '2026-01-01'
),
activity AS (
SELECT DISTINCT
user_id,
DATE_TRUNC('month', event_date) AS active_month
FROM events
)
SELECT
c.cohort_month,
c.acquisition_channel,
EXTRACT(YEAR FROM AGE(a.active_month, c.cohort_month)) * 12
+ EXTRACT(MONTH FROM AGE(a.active_month, c.cohort_month)) AS month_offset,
COUNT(DISTINCT c.user_id) AS active_users
FROM cohort c
LEFT JOIN activity a ON a.user_id = c.user_id
GROUP BY c.cohort_month, c.acquisition_channel, a.active_month
ORDER BY c.cohort_month, c.acquisition_channel, a.active_month;In production this is where most retention investigations actually live. Headline retention drops two points, you slice by channel, and one channel — usually a recent paid push — explains the entire delta. The fix is a budget conversation, not a product change. Without the segmentation you would have launched a quarter of onboarding experiments to chase a marketing artifact.
Common pitfalls
The most damaging pitfall is the EXTRACT(MONTH FROM AGE(...)) shortcut without the YEAR * 12 term. The function returns the month component of the interval, so a thirteen-month gap collapses to one and an eighteen-month gap collapses to six. Retention curves built on this bug look smooth and seductive until somebody notices that the M12 column always matches M0. Always compose the full year-times-twelve plus month-component formula, or switch to DATE_DIFF or DATEDIFF if your warehouse exposes them.
The second trap is an INNER JOIN between cohort and activity. The join silently drops cohorts whose users never returned, so a catastrophically bad acquisition month vanishes from the result and nobody flags it on the dashboard. Use LEFT JOIN, let the zero-activity cohorts surface, and treat their absence in the chart as the loud signal it is.
The third pitfall is reporting on immature cohorts. The 2026-04 cohort cannot have a six-month retention value in May 2026, but careless SQL will compute one based on partial data and present it as a finished number. Filter the reporting frame with cohort_month <= DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '6 months' when you publish an M6 chart, or render unaged cells as em-dashes. Skip this and you will be explaining to leadership why the most recent cohort looks fictionally good.
The fourth pitfall is double-counting. A user who fires five events in March is still one active human, but COUNT(*) will say five and SUM(active) will say five if you forgot the inner deduplication. Use COUNT(DISTINCT user_id) and a DISTINCT or GROUP BY inside the activity CTE — both layers exist on purpose. Interviewers love asking "what happens if a user is hyperactive in one month?" specifically to flush this out.
The fifth pitfall is comparing cohorts whose acquisition mix changed. A January cohort dominated by paid Instagram traffic and a March cohort dominated by organic search are not apples-to-apples — the retention delta is mostly channel composition, not product quality. Either segment by channel before comparing, or weight cohorts to a fixed channel mix. Without this control you will misdiagnose retention drops every quarter when paid budget shifts.
Optimization tips
For large event tables, the bottleneck is the events scan inside the activity CTE. The first lever is partitioning the events table on event_date. Snowflake and BigQuery handle this natively via clustering and partitioning DDL; on Postgres declare monthly partitions explicitly. With partitioning, a query that only needs the trailing twelve months reads twelve partitions instead of the full history — on a billion-row log, the difference is seconds versus minutes.
The second lever is a pre-aggregated monthly_active_users rollup refreshed by a dbt model overnight, shaped (user_id, active_month). Cohort Retention then joins cohort to this pre-aggregate, which is one to two orders of magnitude smaller than the raw event stream. This is the standard Snowflake and Databricks pattern for cohort analytics at scale, and it lets dashboards refresh in under a second.
The third lever is materializing the pivot itself. If ten stakeholders refresh the cohort dashboard each morning, materialize the result on a daily schedule instead of recomputing on demand. The trade-off is freshness — confirm with the consumers that overnight cadence is acceptable before caching, because retention bugs caught at 9 a.m. are far cheaper than retention bugs found in the Friday review.
Related reading
- How to calculate Active Cohort in SQL
- How to calculate MAU in SQL
- How to calculate Churn in SQL
- How to calculate Active Days in SQL
- How to find the Aha Moment in SQL
- SQL window functions interview questions
If you want to drill cohort SQL questions like this daily, NAILDD is launching with 500+ SQL problems on exactly this pattern.
FAQ
Should I use weekly or monthly cohorts?
It depends on the product rhythm. Mobile games, social apps, and most consumer SaaS run on weekly cohorts because the action loop fires multiple times per week and a month is too coarse to see early drop-off. E-commerce, B2B SaaS with annual contracts, and banking products run on monthly cohorts because the natural usage period is at least that long. A weekly cohort on an annual-contract product is mostly noise; a monthly cohort on a daily-engagement game hides the first-week cliff.
What counts as a good retention number?
The honest answer is "compared to what". Mobile app M1 retention of twenty-five to thirty-five percent is normal, forty percent and up is excellent, and below twenty percent usually signals a broken onboarding or a product-market-fit gap. E-commerce six-month retention of twenty to thirty percent is common, SaaS B2B twelve-month retention is healthy at seventy percent and up. Benchmarking against your own historical cohorts is more useful than industry numbers, because every category has wildly different baselines.
What do I do when the retention curve flattens?
A plateau is the goal, not a problem. The users who stay through twelve months are usually the ones who will stay through twenty-four — they have integrated the product into a habit or a workflow. A flat tail in the retention curve is the healthy shape. Worry when the tail is still steep at month twelve or when it never reaches a plateau, because that means the product never crosses the habit threshold and you are constantly refilling the top of the funnel.
Cohort retention is dropping — where do I look first?
Start with segmentation: by acquisition channel, by platform, by plan, by geography. Most retention drops are concentrated in one segment, not spread evenly across the user base. If a single channel explains the drop, the conversation moves to growth and budget. If the drop is even across segments, the conversation moves to product — usually a recent release that quietly changed the onboarding funnel or a pricing experiment that selected for less-committed users. Pair the cohort cut with a calendar overlay of releases and experiments.
Can I compare cohorts of different lengths?
No, and this is the most common rookie mistake. Always compare cohorts at the same M-offset, never the same calendar month. The 2026-01 cohort at M3 versus 2026-03 at M3 is fair. The 2026-01 cohort in May versus 2026-03 in May is not, because one has aged five months and the other two. The whole point of the M-offset framing is to make age the controlled variable.
How do I handle users who churn and come back?
Inside a cohort retention query, a user is simply "active in that period" or not — they are counted in M1 and M3 if they were active in both, and absent from M2 if they were not. The retention number at each offset is the right answer to "how many of the original cohort were active in that month". To track returning users specifically, build a resurrection metric on top: flag users inactive for at least one full period who returned. Resurrection is a useful diagnostic when retention recovers after a dip, because it tells you whether the bounce came from new acquisition or returning users.