SQL for cohort analysis

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Contents: Why cohorts matter, full SQL, step-by-step, ClickHouse and BigQuery variants, pivot, retention flavors, cohort definitions, pitfalls, performance, heatmap reading, related reading, FAQ.

Why cohort analysis is non-negotiable

Cohort analysis is the load-bearing wall of any consumer or SaaS product. If your CEO asks whether the product is improving and you point at a flat DAU chart, you answered a different question. DAU goes up when marketing turns on the firehose. Retention by cohort tells you whether the users you acquired in March behave better than those acquired in January once both have had the same time to settle in.

The scenario is the same at Stripe, DoorDash, Notion, and every Snowflake shop in between. A PM messages on Sunday night: "Need a retention heatmap by signup month for standup tomorrow, did onboarding v2 move anything?" If you cannot write that SQL in fifteen minutes from raw events, you do not pass senior analyst screening at Meta, Airbnb, or Linear.

This post walks the query end to end: a working Postgres version, the CTEs explained, ports to ClickHouse and BigQuery, the pivot to wide format, and the mistakes that destroy the numbers.

What a cohort actually is

A cohort is a group of users who share an entry event in the same time bucket. The default is signup month: every user who first registered in January 2026 belongs to the 2026-01 cohort forever. Retention is tracked at offsets — month 0, month 1, month 2 — and compared across rows.

The output is a triangular matrix. Rows are cohorts in chronological order. Columns are offsets from the cohort start. Each cell is the percentage of the cohort active at that offset.

Cohort Month 0 Month 1 Month 2 Month 3
Jan 26 100% 45% 30% 25%
Feb 26 100% 50% 32% 27%
Mar 26 100% 52% 35% n/a

Each row decays left to right. Healthy products show newer cohorts retaining at least as well as older ones at the same offset. Sharp drops on a row reveal a tracking bug; sharp drops on a column reveal something seasonal.

The full SQL query

Postgres-flavored query against a users table with created_at and an events table with user_id and created_at.

WITH cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('month', MIN(created_at)) AS cohort_month
  FROM users
  GROUP BY user_id
),
user_activity AS (
  SELECT DISTINCT
    user_id,
    DATE_TRUNC('month', created_at) AS activity_month
  FROM events
),
cohort_sizes AS (
  SELECT cohort_month, COUNT(*) AS cohort_size
  FROM cohorts
  GROUP BY cohort_month
)
SELECT
  c.cohort_month,
  EXTRACT(MONTH FROM AGE(ua.activity_month, c.cohort_month))
    + EXTRACT(YEAR FROM AGE(ua.activity_month, c.cohort_month)) * 12
    AS months_since_signup,
  COUNT(DISTINCT ua.user_id) AS active_users,
  cs.cohort_size,
  ROUND(COUNT(DISTINCT ua.user_id) * 100.0 / cs.cohort_size, 2) AS retention_pct
FROM cohorts c
JOIN user_activity ua ON ua.user_id = c.user_id
JOIN cohort_sizes cs ON cs.cohort_month = c.cohort_month
WHERE ua.activity_month >= c.cohort_month
GROUP BY c.cohort_month, months_since_signup, cs.cohort_size
ORDER BY c.cohort_month, months_since_signup;

Four CTEs of intent, one aggregate. Producing this from a blank editor on a whiteboard demonstrates the core skill analytics interviewers check.

Step-by-step walkthrough

The cohorts CTE assigns every user to one cohort by taking the month of their first created_at. Some teams prefer MIN(signup_timestamp) from a dedicated signups table; cleaner if you have it. DATE_TRUNC rounds to the first of the month so every cohort key is canonical.

The user_activity CTE collapses raw events to one row per user_id per activity_month. The DISTINCT is doing real work — without it, a heavy user with thousands of events in March would inflate the count downstream.

The offset calculation is the only piece that is genuinely dialect-specific. Postgres uses AGE(). MySQL writes TIMESTAMPDIFF(MONTH, c.cohort_month, ua.activity_month). BigQuery uses DATE_DIFF(activity_month, cohort_month, MONTH). ClickHouse uses dateDiff('month', c.cohort_month, ua.activity_month).

The final aggregate counts distinct users per (cohort_month, months_since_signup) and divides by the cohort size computed in cohort_sizes. Doing the division inside the same SELECT keeps the result one row per cell, exactly what a downstream pivot expects.

ClickHouse and BigQuery variants

ClickHouse prefers native date functions and uniqExact over COUNT(DISTINCT).

WITH cohorts AS (
  SELECT
    user_id,
    toStartOfMonth(min(created_at)) AS cohort_month
  FROM users
  GROUP BY user_id
),
user_activity AS (
  SELECT DISTINCT
    user_id,
    toStartOfMonth(created_at) AS activity_month
  FROM events
)
SELECT
  c.cohort_month,
  dateDiff('month', c.cohort_month, ua.activity_month) AS months_since,
  uniqExact(ua.user_id) AS active_users
FROM cohorts c
JOIN user_activity ua USING user_id
GROUP BY c.cohort_month, months_since
ORDER BY c.cohort_month, months_since;

BigQuery prefers explicit DATE casts.

WITH cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC(DATE(MIN(created_at)), MONTH) AS cohort_month
  FROM `project.dataset.users`
  GROUP BY user_id
),
user_activity AS (
  SELECT DISTINCT
    user_id,
    DATE_TRUNC(DATE(created_at), MONTH) AS activity_month
  FROM `project.dataset.events`
)
SELECT
  c.cohort_month,
  DATE_DIFF(ua.activity_month, c.cohort_month, MONTH) AS months_since,
  COUNT(DISTINCT ua.user_id) AS active_users
FROM cohorts c
JOIN user_activity ua USING (user_id)
GROUP BY c.cohort_month, months_since
ORDER BY c.cohort_month, months_since;

All three produce the same long table. Reshaping is next.

Pivoting to a wide retention matrix

Dashboards want one row per cohort, one column per offset. Postgres has no clean PIVOT; the idiom is MAX(CASE WHEN ... THEN ...) assuming the long result above is materialized as cohort_data.

SELECT
  cohort_month,
  MAX(CASE WHEN months_since = 0 THEN retention_pct END) AS m0,
  MAX(CASE WHEN months_since = 1 THEN retention_pct END) AS m1,
  MAX(CASE WHEN months_since = 2 THEN retention_pct END) AS m2,
  MAX(CASE WHEN months_since = 3 THEN retention_pct END) AS m3,
  MAX(CASE WHEN months_since = 6 THEN retention_pct END) AS m6,
  MAX(CASE WHEN months_since = 12 THEN retention_pct END) AS m12
FROM cohort_data
GROUP BY cohort_month
ORDER BY cohort_month;

Snowflake and BigQuery have a real PIVOT operator with cleaner syntax. Keep both in your snippets — Databricks panels still expect the CASE WHEN idiom.

Sharpen SQL for your next interview
500+ SQL problems with worked solutions — joins, window functions, CTEs.
Join the waitlist

Retention flavors that matter

Strict retention asks: was the user active in exactly month N? The WHERE clause is activity_month = cohort_month + INTERVAL 'N months'. It is the harshest cut and what most papers cite.

Cumulative retention asks: was the user active in month N or any later month? Shift the clause to >= and accept a smoother curve. More useful for sparse-usage products like tax software or once-a-quarter dashboards.

Rolling retention asks: was the user active in the last K days, anchored to today? It blurs cohorts together and works as a guardrail rather than a matrix. Amplitude and Mixpanel default to strict retention because it surfaces churn cliffs.

Cohort definitions beyond signup month

Signup month is the lazy default. Use signup week once volume allows — weekly cohorts catch product changes that monthly buckets average out. One character changes: DATE_TRUNC('week', MIN(created_at)).

Transactional products like Uber or DoorDash should cohort by first purchase, not first signup. Plenty of users sign up, never order, and pollute the denominator. Use MIN(order_at) from the orders table and require at least one order.

Acquisition channel is the most underused dimension. Cohorting by (signup_month, acquisition_channel) reveals organic users retaining at 60% while paid-social users retain at 25% — information that changes the marketing budget conversation.

Feature-adoption cohorts are the advanced move. Cohort by the month a user first hit an activation event — "first three messages sent" for a chat product, "first invited teammate" for a collaboration tool. Retention anchored to activation is dramatically better than retention anchored to signup, and these are the curves product teams use for OKRs.

Common pitfalls

The pitfall in nearly every first attempt is conflating cohort_month with today's date. People write cohort_month = CURRENT_DATE instead of cohort_month = MIN(created_at) and get a single row of garbage. The cohort is defined at entry time and frozen — it does not change as the user ages. Read the query out loud and check that the cohort assignment lives inside an aggregate over the user, not a filter on the current run.

A subtler trap is missing or misfiring instrumentation. If the events table only started recording app_open on March 1, every cohort before March looks like it dropped off a cliff when in fact you have no data. Sanity-check the earliest event date per event type before computing retention; if instrumentation changed, the analysis has to start at the change.

Timezones eat at least one analyst-day per quarter on every team. A user who signs up at 23:50 UTC on January 31 lands in February locally and in January in UTC. If half your pipeline runs in UTC and the other half pulls from a warehouse already converted to local, cohort assignment disagrees with activity and the matrix becomes nonsense. Pick one timezone — UTC almost always — and convert at the edges.

Active definition drift is a slow killer. Last quarter "active" meant app_open. This quarter the new PM redefined it as "completed a session over 30 seconds." The curve looks worse overnight and someone files a ticket about the product regressing. Pin the active definition in a documented view and treat any change as a versioned migration.

Finally, the join. If you LEFT JOIN events onto cohorts, users with zero events get a NULL activity_month and your offset calculation may either drop them silently or treat them as month 0, inflating month 0 retention. An INNER JOIN keeps the matrix honest at the cost of dropping ghost users, which is the right tradeoff for retention reporting.

Performance and optimization

The highest-leverage index on Postgres is a compound index on events(user_id, created_at). It serves both the user-level and time-bucket groupings. Without it, runtime grows quadratically with horizon.

For a ten-million-user table the unoptimized query takes minutes; with the index and a materialized user_activity refreshed nightly, it drops to seconds. Snowflake and BigQuery handle scale but charge for it — pre-aggregating to a daily-active-users table reduces bytes scanned by an order of magnitude.

Sampling is the iteration escape hatch. Pick a deterministic 10% of user_id using a hash-mod trick. Once the query is right, drop the sample and run on full data for the final number.

For ClickHouse, prefer AggregatingMergeTree materialized views that maintain (cohort_month, activity_month, uniqState(user_id)) incrementally. The dashboard queries the pre-aggregated state in milliseconds and never touches raw events.

Reading the heatmap

Once the matrix lands in a dashboard, the reading happens. Trace the diagonals — month-1 retention for the last six cohorts in a row. If that number is climbing, the product is improving. If it is flat, nothing the team did this quarter moved the needle.

Look for cliffs. A sharp drop between offset 1 and 2 across multiple cohorts means something keeps users for a month and then breaks. A common culprit is a trial that expires at day 30 with no in-app reminder — the SQL flags it, the product team fixes it, the next cohort's matrix shows the cliff softened.

To drill exactly this kind of cohort query against real interview questions every day, NAILDD is launching with hundreds of SQL problems mapped to retention, LTV, funnels, and A/B deltas.

FAQ

Should cohorts be weekly or monthly?

Monthly cohorts are the default because they smooth weekly noise — payday weeks, holiday weeks — into a comparable bucket. Weekly cohorts are appropriate once each weekly bucket still contains a thousand or more users; below that the percentage swings are noise. For early-stage consumer apps under a few hundred thousand monthly signups, monthly is right. For high-volume marketplaces, weekly catches product changes that monthly averages out.

Is rolling retention better than static cohort retention?

Static cohort retention is better for trend analysis because it lets you compare like-for-like cohorts across time. Rolling retention compresses everything into a single recent window and hides which cohort is responsible for movement, which makes it useless for deciding whether your last product change helped. Rolling is fine as a single guardrail on an exec dashboard, but the matrix is what product teams need.

How long does the query take on ten million users?

On unoptimized Postgres with no helpful indexes, expect several minutes. With a compound index on events(user_id, created_at) and a materialized view of user-month activity, the same query drops to seconds. On Snowflake or BigQuery the scan is parallelized and cost is the bottleneck, not time. Pre-aggregate to daily user activity before pivoting and cost drops roughly ten times.

What is the difference between cohort retention and LTV by cohort?

Cohort retention measures the percentage of a cohort still active at offset N. LTV by cohort measures the cumulative revenue per user in that cohort up to offset N. Retention is a denominator-free probability; LTV is a dollar amount. They share the same cohort spine, which is why teams build the retention matrix first and layer LTV on top with revenue events joined to the same cohort assignment.

How do I cohort by feature adoption instead of signup?

Replace MIN(created_at) in the cohorts CTE with MIN(event_at) FILTER (WHERE event_name = 'first_invite_sent') — or whatever your activation event is. Drop users who never hit the event. The rest of the query is unchanged, and the resulting matrix is the curve product teams optimize directly because it isolates engaged users.

Should I use INNER JOIN or LEFT JOIN for the activity table?

Use INNER JOIN. A LEFT JOIN keeps users with zero events as NULL rows that either get filtered by your offset calculation or sneak into month 0 as inflated activity. Count them in the cohort_size denominator (already correct) and let the inner join drop them from the numerator.