Churn prediction modeling guide
Contents:
Why churn prediction matters
Your PM walks over Monday morning and says the team needs a model that flags at-risk paying users before they cancel, so lifecycle can target a save offer. You have two weeks, a Snowflake warehouse, and a vague "the standard ML stack". That is the brief for most churn projects at Stripe, Notion, or DoorDash — not a Kaggle competition with a clean train/test split.
Churn prediction means estimating, per user, the probability they will be inactive or canceled by some horizon — usually 7, 14, or 30 days out. If you only need the headline rate, see churn explained simply; this post is about building the model. The business case rests on three facts: retention is roughly 5x–7x cheaper than acquisition, targeted save offers beat blanket discounts on margin, and feature importances feed straight back into the product roadmap.
Defining a churned user
This is the step that quietly kills most churn projects. Skip it and your "AUC 0.91" model predicts something nobody can act on.
For subscription products (Netflix, Linear, any SaaS tool), the definition is mechanical: canceled, downgraded to free, or failed to renew. Involuntary churn — failed payment with no recovery — is usually modeled separately because the levers differ. For free or freemium products there is no cancel event, so you invent a threshold: churned = no app open in N days, where N is anchored to the inter-session-gap distribution. If 90% of returning users come back within 14 days, a 14-day gap is a defensible churn definition.
-- Distribution of inter-session gaps to pick a churn threshold
WITH user_sessions AS (
SELECT user_id,
created_at,
LAG(created_at) OVER (
PARTITION BY user_id ORDER BY created_at
) AS prev_session
FROM events
WHERE event_name = 'session_start'
),
gaps AS (
SELECT user_id,
EXTRACT(DAY FROM created_at - prev_session) AS days_gap
FROM user_sessions
WHERE prev_session IS NOT NULL
)
SELECT days_gap,
COUNT(*) AS frequency,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct
FROM gaps
GROUP BY days_gap
ORDER BY days_gap;Load-bearing trick: lock the churn definition in writing before you build any features. Half of model-vs-stakeholder fights trace back to silently shifting definitions mid-project.
Features that actually move AUC
Feature engineering is roughly 80% of model quality. The categories below tend to dominate importance plots across most B2C and B2B SaaS use cases.
| Category | Example features | Why it works |
|---|---|---|
| Recency | Days since last session, days since last paid event | Closest thing to a leading indicator |
| Activity volume | Sessions in last 7d / 14d / 30d, key-action counts | Captures absolute engagement level |
| Trend | Ratio of last-7d to prior-7d sessions, weekly activity slope | Catches users still active but declining |
| User profile | Days since signup, acquisition channel, plan tier, device | Stable context that interacts with behavior |
| Engagement depth | Features touched, screens per session, aha-moment reached | Distinguishes habitual users from drive-by ones |
| Monetization | Total spend, days since last purchase, refund history | Strong for paid products; near-useless for free-only |
Recency and activity trend are the top two predictors in almost every churn model I have seen ship. Static profile features (channel, device) tend to land near the bottom of the importance plot. For time-shaped features, patterns from time series feature engineering translate directly: rolling means, ratios over multiple windows, and gap statistics consistently outperform single-window snapshots.
Rules-based baseline
Before training anything, write the dumbest possible scorecard. It gives you a baseline to beat, and if the ML model only matches it, you have a much cheaper system to ship.
import pandas as pd
def churn_risk_score(user: pd.Series) -> int:
"""Simple rules-based score. Higher = more at risk."""
score = 0
if user['days_since_last_visit'] > 14:
score += 3
elif user['days_since_last_visit'] > 7:
score += 2
elif user['days_since_last_visit'] > 3:
score += 1
if user['sessions_last_7d'] < user['sessions_prev_7d'] * 0.5:
score += 2
if user['features_used'] < 3:
score += 1
if user['total_purchases'] == 0:
score += 1
return score # 0–7, higher = higher predicted churn riskTransparent, ships in an afternoon, gives lifecycle a priority list. Downsides: vibes-based thresholds, no way to express interactions like high spend with flat activity versus low spend with falling activity.
The ML approach
Once you have a labeled dataset — one row per user, one binary churned column — churn is plain binary classification. Two models earn their keep before anything fancier: logistic regression and a tree ensemble.
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
df = pd.read_csv('user_features.csv')
features = [
'days_since_last_visit', 'sessions_last_7d', 'sessions_last_30d',
'activity_trend', 'features_used', 'session_depth_avg',
'days_since_signup', 'total_purchases', 'is_premium',
]
X_train, X_test, y_train, y_test = train_test_split(
df[features], df['churned'],
test_size=0.2, random_state=42, stratify=df['churned'],
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
model_lr = LogisticRegression(
random_state=42, class_weight='balanced', max_iter=1000,
)
model_lr.fit(X_train_s, y_train)
y_prob = model_lr.predict_proba(X_test_s)[:, 1]
print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.3f}")Logistic regression is the right baseline: fast, interpretable, and the coefficients translate into "+0.3 log-odds of churn per extra inactive day" — the kind of statement a PM can use. The class_weight='balanced' flag matters because churned users are almost always the minority class — 5%–15% in most subscription products — and without it the model will happily predict "active" for everyone and look 90%+ accurate.
Random Forest (or gradient boosting via lightgbm / xgboost) is your second swing. It captures non-linear interactions, does not need scaling, and the importance plot is useful for the readout deck.
from sklearn.ensemble import RandomForestClassifier
model_rf = RandomForestClassifier(
n_estimators=200, max_depth=10,
class_weight='balanced', random_state=42, n_jobs=-1,
)
model_rf.fit(X_train, y_train)
importance = pd.DataFrame({
'feature': features,
'importance': model_rf.feature_importances_,
}).sort_values('importance', ascending=False)A healthy top-3 looks like days_since_last_visit, activity_trend, sessions_last_7d. If your top features are static (signup channel, device), suspect leakage or a wrong label window.
For validation, do not just train_test_split on rows — split on time. Train on users labeled Jan–Apr, validate on May, test on June. The cross-validation strategies post covers time-aware splits in detail.
Picking the right metric
Accuracy is the wrong default. With a 5% churn rate, the "everyone stays" model gets 95% accuracy and is operationally useless.
| Metric | What it captures | Use when |
|---|---|---|
| ROC AUC | Ranking quality across all thresholds | Default reporting metric; threshold-independent |
| PR AUC | Ranking quality under heavy imbalance | Churn rate below ~5% |
| Precision@K | Of top-K predicted, share who actually churn | You can only act on K users per day |
| Recall@K | Of true churners, share captured in top-K | Each saved user is high-LTV |
| F1 | Harmonic mean at a chosen threshold | You need one number for a binary decision |
Sanity check: if lifecycle can call 500 users a day and you have 50,000 active users, Precision@500 is the only metric that matters at launch. Optimize for it, report it next to AUC, and stop arguing about F1.
From scores to interventions
A probability without an action is a research artifact. A reasonable default segmentation: high risk (p > 0.70) gets expensive actions — a CS call for B2B, a hand-tuned save offer for B2C, a cancel-reason survey. Medium risk (0.40 < p ≤ 0.70) gets lightweight nudges — re-engagement email, a push highlighting an underused feature, a small discount. Low risk (p ≤ 0.40) gets nothing; spending on the safe segment dilutes ROI and trains the team to chase the wrong gradient.
The non-negotiable next step is an A/B test. Treatment gets the intervention, holdout does not, churn rate is measured at the same horizon used in training. The mechanics are in holdout vs A/B testing in practice. If you want structured churn-prediction interview drills, naildd has the patterns mapped to exactly these scenarios.
Then retrain. Behavior drifts, the product ships new features, the funnel changes. Quarterly retrain is the floor; monthly is better. Track ROC AUC and Precision@K of the live model and retrain when either drops more than a few points.
Common pitfalls
Data leakage is the number one killer. You include account_deleted_flag or last_cancellation_reason as a feature, the model nails the test set, then collapses in production because those columns are populated after churn. Audit every feature with one question: would this value be knowable at the moment of prediction? If not, drop it.
Class imbalance is the second. Churned users are typically 5%–15% of the labeled set, and a model trained without rebalancing will happily predict "active" for everyone. Use class_weight='balanced', oversample with SMOTE, or optimize against AUC / F1 instead of accuracy. Whichever you pick, document it — reviewers will ask.
Wrong horizon trips up more juniors than seniors. Predicting 30-day churn from 7 days of behavior is hard because the signal lives in the longer window you did not give the model. The feature window should be at least 2x–3x the prediction horizon — for a 30-day prediction, use 60–90 days of history.
Concept drift after launch ages projects. The model trained on Q1 users on the old onboarding; in Q3 the team shipped a new onboarding and feature distributions shifted. Scores are still produced but they are miscalibrated. Monitor input and prediction distributions, not just downstream churn, so you spot drift before the business does.
Confusing the model with the action. A high churn probability is not a cause. The model says a user looks like historical churners; it does not say why. Pair the score with exit surveys and cohort analysis on at-risk segments, or you will ship discount coupons instead of product fixes.
Interview-style questions
How would you approach a churn prediction project from scratch? Nail the churn definition with product — subscription cancel, or N-day inactivity with N from the gap distribution. Build a labeled dataset with a time-respecting split. Engineer recency, trend, and activity features. Train logistic regression as a baseline, then a tree ensemble. Evaluate with ROC AUC plus Precision@K tied to the operational budget. Hand off to lifecycle with risk segments and A/B test before claiming success.
Which features matter most? Recency, short-window activity trend, and last-7d session count dominate. Static profile features land near the bottom. If your top features are static, suspect leakage or a wrong label window.
Why is accuracy a bad default metric? Churn is heavily imbalanced. With 5% churn, predicting "active" for everyone gets 95% accuracy and is worthless. Use ROC AUC for ranking, PR AUC if churn is below 5%, and Precision@K tied to the number of users the team can intervene on.
What is data leakage and how do you prevent it? Leakage is a feature containing information unavailable at prediction time — cancellation reason, deleted-account flag, anything populated after the event. Prevention: time-respecting splits, an explicit prediction timestamp per row, and a feature audit.
How would you measure business impact? A/B test the intervention. Treatment gets the save action, holdout does not, measure churn rate at the training horizon. Business case is (saved users) * LTV - intervention cost. Anything else is a vanity number.
Related reading
- Churn explained simply
- How to calculate churn risk score in SQL
- How to calculate involuntary churn in SQL
- Cross-validation strategies
- Feature engineering interview prep
- Holdout vs A/B testing in practice
FAQ
Do I need ML for churn prediction, or are rules enough?
For an MVP or a small user base, rules are often enough. A scorecard built on recency, trend, and feature adoption gives lifecycle a useful priority list. ML earns its keep at tens of thousands of users with dozens of candidate features and a real cost to misallocating outreach. Honest rule of thumb: ship rules in week one, train the ML model in weeks two through four, and keep ML only if it beats rules on Precision@K by a margin that justifies maintenance.
How often should I retrain the model?
Quarterly is the floor for most subscription products. Monthly is better if the product ships features fast or the audience is changing. The trigger to retrain sooner is a measurable drop in live ROC AUC or Precision@K versus deploy-time values, or visible drift in input feature distributions.
Logistic regression or random forest — which should I pick?
Start with logistic regression. It trains in seconds, coefficients are interpretable, and it is the right baseline. Move to random forest or gradient boosting only if the tree model beats it by a meaningful margin — typically 0.02–0.03 of AUC. For interviews, demonstrating you understand the trade-off matters more than picking the "right" model.
How do I handle severe imbalance, like 1% churn rate?
Three options, often combined. Class weighting (class_weight='balanced') is the cheapest. Oversampling with SMOTE helps tree models. Switching the optimization target from accuracy to PR AUC or Precision@K scores the model on the rare class. With 1% churn, prefer PR AUC over ROC AUC, since ROC AUC stays deceptively high under extreme imbalance.
What prediction horizon should I use?
Match the horizon to the action window. If lifecycle can act within a week, predict 7-day churn. If the save campaign runs over a month, predict 30-day churn. The feature window should be at least 2x–3x the horizon. Longer horizons mean lower precision; start short and lengthen only if the action requires it.