Cohort analysis explained simply
Contents:
What cohort analysis actually is
Cohort analysis splits users into groups by the date of their first interaction with the product, then tracks each group's behavior across time. The grouping is what matters: every user inside a cohort shares a fixed "age zero" moment, so day 7 means the same thing for everyone in that row.
The technique is load-bearing because it separates the product you ship today from the product you shipped six months ago. A single rolled-up retention number averages everything together — you cannot tell whether your onboarding got worse or your loyal users are just sticking around. With cohorts, the difference is visible in two adjacent rows of a table.
If your PM walks up on Monday morning and asks why DAU is flat but the new feature was supposed to boost it, the answer almost always lives in a cohort table. Aggregate metrics lie. Cohorts do not.
Load-bearing trick: a cohort is defined by time of first event, not by a static attribute. "Cohort: Chrome users" is a segment. "Cohort: users who signed up in March 2026" is a cohort.
Why it beats a single retention number
Imagine your overall day-30 retention has held at 22% for four months straight. Leadership is happy. Then you slice by signup month:
| Signup cohort | D0 | D1 | D7 | D30 |
|---|---|---|---|---|
| Jan 2026 | 100% | 40% | 20% | 10% |
| Feb 2026 | 100% | 45% | 25% | 15% |
| Mar 2026 | 100% | 35% | 15% | 5% |
| Apr 2026 | 100% | 30% | 12% | 4% |
The aggregate is stable only because older cohorts are larger and their tail behavior dominates the average. The new cohorts — the ones reflecting your current product — are collapsing. Without the breakdown you would have missed this for another quarter.
This is the core argument for cohorting:
- Detect regressions caused by a release, a pricing change, or a new acquisition channel.
- Measure improvements to onboarding by comparing the first 7 days of new cohorts against old ones.
- Validate experiments that have long-tail effects A/B tests cannot catch in 14 days.
- Forecast LTV by integrating under a stabilized retention curve.
This is also why the first dashboard a serious growth team builds is almost never DAU — it is a cohort heatmap.
Building the cohort table in SQL
The recipe is two CTEs and a join. The first CTE pins each user to their signup month. The second pulls all activity. The final select computes months since signup per row and aggregates.
WITH cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(event_at)) AS cohort_month
FROM events
GROUP BY user_id
),
activity AS (
SELECT DISTINCT
user_id,
DATE_TRUNC('month', event_at) AS active_month
FROM events
)
SELECT
c.cohort_month,
a.active_month,
(EXTRACT(YEAR FROM AGE(a.active_month, c.cohort_month)) * 12
+ EXTRACT(MONTH FROM AGE(a.active_month, c.cohort_month)))::int AS months_since_signup,
COUNT(DISTINCT a.user_id) AS active_users
FROM cohorts c
JOIN activity a ON a.user_id = c.user_id
GROUP BY 1, 2, 3
ORDER BY 1, 3;To turn this into a retention percentage, divide each cell by the cohort's size at month zero:
WITH base AS (
-- previous query here
),
sized AS (
SELECT cohort_month, COUNT(DISTINCT user_id) AS cohort_size
FROM cohorts
GROUP BY 1
)
SELECT
b.cohort_month,
b.months_since_signup,
1.0 * b.active_users / s.cohort_size AS retention
FROM base b
JOIN sized s USING (cohort_month);In Python the same shape lands in a few lines of pandas:
import pandas as pd
df['cohort'] = df.groupby('user_id')['event_at'].transform('min').dt.to_period('M')
df['period'] = (df['event_at'].dt.to_period('M') - df['cohort']).apply(lambda x: x.n)
cohort_table = df.groupby(['cohort', 'period'])['user_id'].nunique().unstack('period')
retention = cohort_table.divide(cohort_table[0], axis=0)The output is a square matrix: rows are cohorts, columns are periods since signup, cells are retention fractions. That matrix is what every cohort chart in the wild visualizes.
How to read the heatmap
A cohort table is read in three directions, and each direction answers a different question.
Read across a row to see one cohort's lifecycle. Retention should decay — that is just the shape of human behavior. What you care about is the slope and the floor. A steep drop from D1 to D7 means weak activation. A floor near zero by D30 means the product is not sticky.
Read down a column to compare cohorts at the same age. If March D7 is 15% while January D7 was 20%, something shipped between January and March made the product worse for newcomers. This is the most useful read because it isolates product changes from user-mix changes.
Read along the diagonal to see how a single calendar period treated everyone. The diagonal answers questions like "did the holiday season lift retention across all cohorts?" — useful for separating product effects from seasonal effects.
Sanity check: if the rightmost cells of a row are systematically empty, that cohort has not been alive long enough to fill them. Do not compare a 1-month-old cohort's M3 against a year-old cohort's M3. There is no M3 yet.
The shape you want is a smiley curve: sharp early drop, then a stable floor that does not decay to zero. A floor of 30-40% at day 90 is excellent for most consumer apps. 5-10% is typical. Zero is a product that does not retain anyone.
Cohorts beyond signup date
Signup date is the default, but the cohort definition can be any time-anchored event. This is where the technique earns its keep.
| Cohort definition | Question it answers |
|---|---|
| First payment date | Do paying users from Q1 churn faster than Q2? |
| First touch acquisition channel + month | Does paid Google retain worse than organic for D30? |
| First device (iOS / Android) | Are iOS launches stickier after the new onboarding flow? |
| Plan tier at signup | Do free-tier-first users upgrade more than paid-from-day-one? |
| Feature first-use date | When users first try the dashboard, do they stick around? |
Each variant uses the same SQL skeleton — only the cohort_month definition changes. The technique is fixed; the anchor event is the variable you tune to match the question.
Revenue cohorts are especially useful. Instead of COUNT(DISTINCT user_id), the aggregation becomes SUM(revenue) divided by the original cohort size. A healthy revenue cohort grows over time as loyal users expand — this is net dollar retention above 100%, the signal that drives SaaS valuations.
Common pitfalls
The first and most common mistake is calling a segment a cohort. A cohort must have a time origin. "Users in Tier A" is a segment; "users who entered Tier A in March 2026" is a cohort. Without the time anchor, you cannot read a row as a lifecycle, which defeats the whole point. The fix is mechanical: every cohort definition must include MIN(event_at) or an equivalent timestamp grouping.
A second trap is choosing the wrong cohort granularity. Monthly cohorts smooth over weekly product changes — a feature that shipped on the 28th gets averaged with a feature that shipped on the 5th. Weekly cohorts surface those changes but introduce noise when cohort sizes are small. The rule of thumb is to match cohort width to release cadence: weekly cohorts for fast-moving consumer products, monthly for B2B SaaS, quarterly only for very slow contractual products.
The third pitfall is comparing cohorts of unequal observation length. The April cohort has only been alive for one month; the January cohort has been alive for four. You cannot compare their M3 retention because April's M3 does not exist yet. Always truncate comparisons to the shortest available period across the cohorts you are comparing, or you will fool yourself into thinking new cohorts are better when they are simply younger.
Pitfall four is small-cohort noise. A cohort of fifteen users will swing wildly with a single user's behavior. Below roughly 100 users per cohort, the retention numbers are mostly noise. The fix is either to widen the cohort window (weekly into monthly), or to compute confidence intervals around each cell and visualize the uncertainty.
The final classic mistake is ignoring survivorship bias in cohort comparisons. When you look at "users who reached day 30," you are conditioning on survival, which inflates every downstream metric. Cohorts side-step this by reporting the share of the original signup cohort that is still active — not the share of survivors. If you ever see a cohort retention curve that goes up after the first period, someone redefined the denominator mid-table.
Related reading
- What is a cohort explained simply
- SQL for cohort analysis
- SQL for cohort retention
- How to calculate cohort retention in SQL
- How to calculate LTV by cohort in SQL
- Cohort analysis for the data science interview
If you want to drill cohort questions like these every day on real datasets, NAILDD has hundreds of interview-grade SQL problems built around exactly this pattern.
FAQ
When should I use a cohort instead of a funnel?
Funnels and cohorts answer different questions and you usually need both. A funnel tracks what fraction of users complete a sequence of steps — signup, activate, pay — and is bounded by the steps you define. A cohort tracks what fraction of users are still active N periods later and is bounded by time. If your question is "where do people drop off in onboarding," use a funnel. If it is "is the product getting better at retaining people," use a cohort.
What cohort window should I pick?
Match the window to your release cadence and your traffic volume. Consumer apps with daily releases and millions of users do well with weekly cohorts. B2B SaaS with quarterly releases and a few thousand signups per month do better with monthly cohorts. If your cohort cells contain fewer than 100 users on average, the window is too narrow and the table is noise.
How many cohorts should I include in a chart?
Six to twelve is the sweet spot. Fewer than six and you cannot see a trend; more than twelve and the chart becomes unreadable and the older cohorts reflect a product that no longer exists. For trend analysis, the most recent six cohorts carry almost all the signal — older ones are useful as a baseline but rarely tell you what to do next.
Should I cohort by signup date or by first-paid date?
Both, for different reasons. Signup-date cohorts measure top-of-funnel and activation. First-paid-date cohorts measure post-monetization retention and revenue expansion, which is what investors actually care about for SaaS. If you only build one, build signup cohorts — they cover everything downstream. If you can build two, layer first-paid on top to track the paying-customer lifecycle separately.
How do I visualize a cohort table?
The two standard formats are heatmaps (color-coded matrix with cohorts on rows, periods on columns) and retention curves (line chart with one line per cohort, period on the x-axis, retention on the y-axis). Heatmaps are better for spotting anomalies in a dense matrix; curves are better for presenting a clear story to non-analysts. Most analyst dashboards include both.
Can cohorts confirm causation?
No — cohorts are descriptive, not causal. They will show you when a regression started, which is enormously helpful for narrowing a hypothesis, but they cannot tell you why. For causality you need a controlled experiment (A/B test) or a quasi-experimental method like difference-in-differences. Treat the cohort table as a diagnostic instrument, not a verdict.