Cohort analysis in a Data Science interview
Contents:
Why cohorts show up in DS interviews
Cohort analysis is the single most common product-analytics question in Data Scientist interviews at consumer tech. Recruiters at Meta, Uber, DoorDash, and Airbnb almost always include some flavor of "we shipped a change in March — was retention better for the April cohort?" The expected answer is not a vibe; it is a week-by-week retention curve, a sanity check on cohort size, and a story about what could have confounded the result.
The reason interviewers love this topic is that it stress-tests three skills at once: SQL fluency (the join and window function), product judgment (acquisition vs behavioral cohorts), and statistical care around right-censoring for cohorts that have not lived long enough to have a Day 30 number. A candidate who flips the cohort definition mid-question, or who reads the triangle table column-wise when it should be row-wise, loses the loop in under five minutes.
This guide walks the framing a strong interviewer expects: cohort definition first, then the retention curve, then the triangle table, then the traps. The SQL is short. The interpretation is the hard part.
What a cohort actually is
A cohort is a group of users who share a defining characteristic at a point in time. The characteristic is the part you have to nail in the first sentence of your answer.
There are two flavors interviewers care about:
| Cohort type | Defining event | Typical example | Common interview prompt |
|---|---|---|---|
| Acquisition cohort | Signup, install, or first session | "Users who signed up in week 2026-W18" | "How did D7 retention move across April vs May cohorts?" |
| Behavioral cohort | Any action after signup | "Users who completed onboarding" | "Do users who add a friend in the first session retain better at Day 30?" |
| Revenue cohort | First paid event | "Users with first purchase in May" | "Compare LTV across May vs June paying cohorts" |
The acquisition cohort is the default; if the interviewer says "cohort" without a qualifier, assume signup-month. The behavioral cohort is the unlock when the question is about causal-looking claims ("does completing onboarding improve retention?"). The trap is that behavioral cohorts are self-selected, so any comparison against a control needs propensity matching or an actual experiment to mean anything.
Load-bearing rule: every cohort answer starts with one sentence defining (a) the cohort, (b) the granularity (week vs month), and (c) the metric and time horizon. If you skip any of the three, the interviewer will ask, and you will sound less crisp than the candidate who led with it.
Retention cohorts and the SQL
The most common cohort question is retention: of the users who signed up at time T, what fraction were still active at T+1, T+7, T+30, T+90? This is the question Mixpanel and Amplitude charts were built for, and the one you need to be able to write in SQL without thinking.
Here is the canonical pattern. Postgres dialect, week granularity, but the shape is identical in Snowflake and BigQuery — only the date functions change.
WITH cohorts AS (
SELECT
user_id,
DATE_TRUNC('week', signup_date)::DATE AS cohort_week
FROM users
),
activity AS (
SELECT
c.cohort_week,
FLOOR(EXTRACT(EPOCH FROM (e.event_date - c.cohort_week)) / (86400 * 7))::int AS week_n,
COUNT(DISTINCT e.user_id) AS active_users
FROM cohorts c
JOIN events e
ON e.user_id = c.user_id
AND e.event_date >= c.cohort_week
GROUP BY 1, 2
),
sized AS (
SELECT
cohort_week,
week_n,
active_users,
FIRST_VALUE(active_users) OVER (
PARTITION BY cohort_week ORDER BY week_n
) AS cohort_size
FROM activity
)
SELECT
cohort_week,
week_n,
active_users,
cohort_size,
ROUND(100.0 * active_users / cohort_size, 2) AS retention_pct
FROM sized
ORDER BY cohort_week, week_n;Three things to call out when you explain this to the interviewer:
DATE_TRUNCover the signup column sets the cohort granularity. Switching from'week'to'month'is the only change needed to flip the report.FIRST_VALUEover the cohort partition capturesweek_n = 0as the denominator, which is exactly the cohort size. If your data has a hole at week 0, this returns a misleading number — always verify Day 0 is 100% in your output.- The
JOINfiltere.event_date >= c.cohort_weekis mandatory. Without it, you double-count pre-signup events, which is more common than you'd think when the events table is backfilled from a third-party SDK.
A typical retention curve for a healthy consumer app looks like:
Cohort: users who signed up in 2026-W18 (1,000 users)
Week 0: 100.0% (1,000)
Week 1: 52.4% (524)
Week 2: 38.1% (381)
Week 4: 29.7% (297)
Week 8: 23.5% (235)
Week 12: 20.8% (208)If you see Week 1 retention below 30%, the onboarding is broken or the install attribution is loose (bot traffic, paid installs that never opened the app). If you see Week 12 above Week 8, you have a join bug — retention curves are weakly monotonic.
Behavioral cohorts
A behavioral cohort is defined by an action, not by signup. The classic interview prompt is: "users who added a payment method in their first session retain at 65% on Day 30 versus 25% for users who didn't. Should we force everyone to add a payment method during onboarding?"
The right answer is "no, not from this data alone." Users who added a payment method are self-selected: they were already more committed. Forcing the action on the control group will not magically grant them the 65% retention number. The trick the interviewer is checking is whether you spot the selection bias and propose either a (a) propensity-matched cohort comparison or (b) an A/B test that actually randomizes the prompt.
That said, behavioral cohorts are still valuable for finding leading indicators — first-session actions that correlate with Week 12 retention. Adding a friend, finishing the tutorial, importing a CSV become candidate north-star activation events. Power users who used the keyboard shortcut are a clean behavioral cohort for product-variant comparison.
| Behavioral cohort | Comparison group | Watch for |
|---|---|---|
| Completed onboarding | Started but did not complete | Selection bias from motivated users |
| First-session purchase | First-session non-purchaser | Confounded by intent at install |
| Invited a friend | Did not invite a friend | Network effect inflates retention |
| Used desktop within Day 7 | Mobile-only Day 7 | Device cohort, not behavior |
Reading the triangle table
The triangle table is the artifact every PM and analyst stares at. Rows are acquisition cohorts (newer at the bottom). Columns are time since acquisition. Cells are retention percentages. The diagonal is empty for cohorts that have not yet aged into that column — this is right-censoring, and confusing it with 0% retention is the single fastest way to fail this round.
| Cohort | Week 0 | Week 1 | Week 4 | Week 8 | Week 12 |
|---|---|---|---|---|---|
| 2026-Feb | 100% | 50% | 28% | 22% | 19% |
| 2026-Mar | 100% | 53% | 31% | 25% | 21% |
| 2026-Apr | 100% | 56% | 33% | 27% | — |
| 2026-May | 100% | 58% | 35% | — | — |
How to read this out loud to the interviewer:
- Down a column = same time-since-signup, different cohorts. This is the comparison you usually want. "Week 4 retention moved from 28% in February to 33% in April — the March onboarding change is holding."
- Across a row = one cohort decaying over time. Useful for fitting a decay curve and projecting LTV, but not for comparing product versions.
- The empty cells are not zeros. The May cohort has only existed for two weeks. Pretending an em-dash means 0% is the rookie move that interviewers grade on.
Sanity check: before you read a triangle table to anyone, confirm Week 0 is exactly 100% for every cohort. If a row starts at 97% or 102%, you have a date-bucketing bug — usually a timezone mismatch between signup_date and event_date.
Common pitfalls
The first pitfall is mixing acquisition and behavioral cohorts mid-answer. A candidate defines the cohort as "users who signed up in April" then a sentence later adds "and they all completed onboarding." Now the cohort is the intersection and the denominator is no longer the signup count. The fix is to commit to one definition before writing SQL; if you want both, build the acquisition cohort first and join the behavioral filter as a flagged dimension.
The second pitfall is treating right-censoring as zero. The newest cohort hasn't lived long enough to have a Day 30 number. The cell is empty, not zero. If you average across cohorts to get an overall Day 30 retention, exclude cohorts that have not aged to Day 30 — otherwise the average drops every Monday as new, half-aged cohorts dilute the number. This is the bug that makes a dashboard look like retention is falling when it is actually fine.
The third pitfall is using the wrong granularity for the question. A weekly cohort on a B2B SaaS with annual contracts is noise. A monthly cohort on a consumer app with Day 1 churn hides the most important drop. The rule of thumb is to match the cohort granularity to the natural cycle of the product: daily for games and social, weekly for utilities, monthly for SaaS, quarterly for enterprise.
The fourth pitfall is survivor bias when comparing cohort sizes. Older cohorts look smaller in the active-user count because they have churned more, not because they were smaller at signup. Always normalize to the cohort-size denominator (week_n = 0), and report retention percentages, not raw active counts, when comparing cohorts of different ages.
The fifth pitfall is forgetting to deduplicate events. If a user fires three "session_start" events in the same day, a naive COUNT(events) triples them. Use COUNT(DISTINCT user_id) inside the cohort week bucket. This is the bug that makes Week 1 retention look like 120% and ends the loop early.
Related reading
- SQL for cohort analysis
- How to calculate cohort retention in SQL
- How to calculate D1, D7, D30 retention in SQL
- How to calculate LTV by cohort in SQL
- SQL window functions interview questions
If you want to drill cohort and retention questions like this every day, NAILDD is launching with 500+ SQL and product analytics problems built around exactly this interview pattern.
FAQ
What is the difference between acquisition cohorts and behavioral cohorts?
An acquisition cohort groups users by when they signed up, installed, or first opened the product. A behavioral cohort groups users by an action they took afterward — completing onboarding, first purchase, inviting a friend. Acquisition cohorts are clean for tracking product health over time; behavioral cohorts are useful for finding leading indicators, but they carry selection bias and cannot support causal claims without matching or randomization.
How do I handle right-censoring in a cohort table?
Right-censoring means the cohort has not lived long enough to have a value at the time horizon you are reporting. The May cohort, observed in mid-May, does not have a Day 30 number yet. The correct treatment is to leave the cell empty (an em-dash, a NULL, or a blank) and exclude that cohort from any aggregate that asks "average Day 30 retention." Replacing the empty cell with zero will systematically drag your dashboard down every week and trigger false alarms in your weekly review.
Why does my Week 0 retention come back at 97% instead of 100%?
Almost always a timezone or date-bucketing mismatch between the signup table and the events table. If signup_date is UTC and event_date is local, a user who signs up at 11pm UTC Sunday lands in the next week's cohort but has a Week 0 event in the prior week — and the join condition filters them out. Cast both columns to the same timezone before truncating, and verify Week 0 is exactly 100% before publishing.
When should I use weekly cohorts versus monthly cohorts?
Match the cohort granularity to the natural cycle of the product. Consumer apps with daily-active patterns and short signup-to-churn windows usually want daily or weekly cohorts. B2B SaaS with monthly billing usually wants monthly cohorts. Enterprise products with annual contracts use quarterly cohorts. The signal you want to capture is "did the cohort meaningfully change between consecutive buckets" — if you can't see week-to-week change because the cohorts are too small, go monthly; if monthly hides the Day 1 drop, go weekly.
Is cohort analysis the same as survival analysis?
They overlap but are not identical. A retention curve is essentially a non-parametric Kaplan-Meier estimate, and a cohort triangle is a stratified version of the same idea. Survival analysis adds a parametric layer — Cox proportional hazards, Weibull fits — that lets you compare covariates while controlling for censoring. In a DS interview, cohort analysis is the descriptive baseline and survival analysis is the inferential follow-up.
How many cohorts before the trend is meaningful?
For consumer products with thousands of weekly signups, 6 to 8 consecutive cohorts is usually enough. Below that, you are watching noise. For smaller products under 100 users per cohort, smooth across longer windows and treat single-cohort wiggles as suspect until two consecutive cohorts confirm the move.