A/B test vs holdout test
Contents:
- Short answer first
- The A/B test setup
- The holdout test setup
- Side-by-side differences
- A worked example: checkout redesign
- Long-term holdouts in big tech
- Incrementality, the real reason holdouts exist
- Alternatives when a holdout will not fly
- Practical sizing and duration
- Common pitfalls
- Related reading
- FAQ
Short answer first
An A/B test compares two versions of a product change — control versus treatment — so the team can pick a winner. A holdout test deliberately keeps a small slice of users without the new thing for a long stretch of time, so the team can measure how much the new thing actually added in the wild. Same randomization mechanic, very different question.
Use an A/B test when the decision is "which version do we ship". Use a holdout when the decision has been made and you want to know "how much did the shipped thing buy us over weeks or months". Confusing the two is one of the fastest ways to lose credibility with a senior PM in an interview loop.
The A/B test setup
A classic two-arm A/B test splits the eligible audience roughly evenly. Fifty percent stay on the current experience, fifty percent see the new variant, and randomization happens at user, account, or session level. The experiment usually runs for two to four weeks — long enough to capture weekday and weekend behavior, short enough that calendar effects do not drift in.
The use cases are familiar: a UI tweak on the home feed, a new feature behind a flag, a pricing page redesign, a change to the ranking model on the For You shelf. The team is forced to choose between two well-defined options, and the test answers "which version is better on the metric we care about". When the test reads out, the winning variant gets the ramp.
A/B tests are built around a fixed sample size, a pre-registered metric, and a single planned readout. That is why peeking at the p-value daily is such a famous mistake — see A/B testing peeking for why daily checks inflate false positives to roughly 30–40 percent.
The holdout test setup
A holdout flips the ratio. The treatment group is huge — 95, 99, even 99.5 percent of users — and a small protected slice (1–5 percent) is locked out of the feature for the long term. The duration is months, sometimes years. The point is to keep a clean comparison cohort that never tasted the new thing, so the team can answer "what did this feature add over the lifetime of a user".
The use cases are where it gets interesting. Companies run holdouts to measure the cumulative impact of a whole product line, not just one launch. They estimate the long-run lift from recommendations, push notifications, ads, personalization, and onboarding. Anything where value compounds — habit formation, network effects, content discovery — needs a long-horizon comparison group, and an A/B test that ends in three weeks cannot give you that.
Side-by-side differences
| Dimension | A/B test | Holdout test |
|---|---|---|
| Group split | ~50 / 50 | 95 / 5 or 99 / 1 |
| Duration | 2–4 weeks | months to years |
| Goal | pick a winner | measure incremental lift |
| Timing | before the rollout decision | after the rollout |
| Ethics | well-accepted | sometimes contested |
| Statistical risk | peeking, p-hacking | contamination, attrition, drift |
The dimension that trips people up is timing. An A/B test happens before the decision to launch. A holdout typically happens after — once the team has decided the feature is good enough to ship to everyone, they carve out a tiny piece for long-term measurement instead of going to 100 percent immediately.
A worked example: checkout redesign
Imagine you work on payments at a marketplace, and the team rebuilt the checkout flow. A two-week A/B test runs: control gets the old checkout, treatment gets the new one. New shows a +3 percent conversion rate, significant at p < 0.01, with no negative downstream metrics. The team ramps to 100 percent.
Six months later leadership asks: "How much revenue did the new checkout actually add over the year?" With only the A/B result, the honest answer is "we measured +3 percent in a two-week window, six months ago, and we are extrapolating". Extrapolation is fine when stakes are low, but a checkout change is anything but low-stakes.
This is where a holdout earns its keep. Suppose instead of going to 100 percent, the team ramped to 99 percent and kept a 1 percent holdout on the old checkout. After six months you compare. The lift might be +5 percent because the novelty effect washed out and sustained behavior change kicked in. Or the lift might decay to +1 percent because users got used to the new flow. Either way you have a real number, not a guess.
Here is the SQL spine for computing holdout lift on a revenue metric. Grouping is on experiment assignment, not on whether the user actually saw the feature.
WITH assignment AS (
SELECT
user_id,
variant, -- 'holdout' OR 'treatment'
assigned_at
FROM experiments.checkout_holdout_2026
),
revenue AS (
SELECT
user_id,
SUM(order_amount_usd) AS revenue_6mo
FROM payments.orders
WHERE order_ts BETWEEN DATE '2026-01-01' AND DATE '2026-06-30'
GROUP BY user_id
)
SELECT
a.variant,
COUNT(DISTINCT a.user_id) AS n_users,
AVG(COALESCE(r.revenue_6mo, 0)) AS arpu_6mo,
SUM(COALESCE(r.revenue_6mo, 0)) AS total_revenue_6mo
FROM assignment a
LEFT JOIN revenue r USING (user_id)
WHERE a.assigned_at < DATE '2026-01-01'
GROUP BY a.variant;Treatment ARPU minus holdout ARPU is your incremental lift, and multiplying by the treatment group size gives total dollars added by the new checkout.
Long-term holdouts in big tech
Google, Meta, Netflix, Amazon, Uber, and Stripe all maintain long-running holdouts. A small percentage of users — often 1 percent — is held out from a major surface: personalized ranking, promotional emails, ad targeting, notifications, the recommendations carousel.
The point is rarely to evaluate a single feature. It is to estimate the cumulative impact of a whole product capability. "What is the lifetime value of personalized recommendations?" "What is the long-run lift from push notifications?" "How much does the entire growth team contribute to MAU?" Long-term holdouts answer these; A/B tests cannot.
The risks are real. Holdout cohorts drift as the surrounding product evolves — the old experience starts to feel broken in ways no one notices. There is also the ethical question of withholding value for months. Most companies cap holdouts at a single-digit percentage and rotate users in and out.
Incrementality, the real reason holdouts exist
A holdout measures incremental impact:
incremental_impact = metric(treated) - metric(holdout)That subtraction is doing all the heavy lifting. Without a randomized holdout you are stuck with observational comparisons — "users who use the feature versus users who do not" — which is wide open to selection bias. People who choose to use a feature are different from people who do not, and that difference shows up in every downstream metric whether the feature did anything or not.
A randomized holdout sidesteps this. Because assignment is random, the only systematic difference between holdout and treatment is the feature itself. An A/B test gives the same guarantee but only for the short window of the test. A holdout extends it to the long horizon where retention, LTV, and habit formation actually live.
Alternatives when a holdout will not fly
Sometimes a holdout is impossible. The feature might be infrastructure-level (a payments rail upgrade, a new search backend) where you cannot partition users. Legal might object to withholding a safety feature. The product might be so small that a 1 percent holdout is useless.
Geo experiments are the most popular alternative. You launch the feature in a set of cities, states, or countries and not in another, then compare aggregate metrics. With matched geographies and a difference-in-differences analysis, this gives a clean causal estimate without user-level randomization.
Marketing mix modeling (MMM) is a step further from clean causal inference. It is an econometric model on aggregated time-series data, used by large advertisers to allocate budget across channels. No holdout required, but the price is heavy modeling assumptions and wider confidence intervals.
Practical sizing and duration
Sizing comes down to two numbers — the variance of the metric and the lift you want to detect. For a high-traffic product with millions of users, a 1 percent holdout is a comfortable 10,000 to 100,000 users, plenty for revenue, conversion, and retention metrics. For a 100,000-user product, a 1 percent holdout is 1,000 users — borderline for anything noisier than basic conversion, and geo experiments may be a better fit.
Duration depends on what you are measuring. A holdout aimed at short-term impact only needs a month. One aimed at LTV, retention, or habit formation needs six to twelve months minimum. A permanent holdout is sometimes used for surfaces like recommendations or notifications, where the goal is an ongoing baseline. The SQL CUPED walkthrough can shrink the required size by leaning on pre-experiment behavior.
Common pitfalls
The first and most common pitfall is treating an A/B test result as if it told you long-term impact. A 3 percent lift in two weeks does not mean a 3 percent lift over a year. Novelty effects, learning curves, and seasonal patterns can warp the long-run number in either direction. If long-term impact matters, you need a holdout or at minimum a periodic re-test, not a single short A/B readout.
A second trap is running a holdout without proper randomization. If holdout users were selected by region, signup date, device, or any non-random rule, the comparison is contaminated by selection bias and the incrementality estimate is meaningless. The whole point of a holdout is the random assignment; skipping it turns the analysis into an observational study with all the usual confounds.
A third pitfall is making the holdout window too short for the metric. Retention, LTV, and habit-driven outcomes need months to materialize, and a four-week holdout will not show what a six-month holdout would. Teams that rush long-horizon holdouts end up reporting a weak number, concluding the feature was a flop, and rolling it back prematurely.
A fourth issue is contamination. Holdout users can stumble onto the new feature through shared accounts, a friend who has it, or an A/B platform misconfiguration. Even 5 percent contamination cuts your measured incremental lift by roughly that fraction. Audit logs, periodic spot-checks, and a "feature actually seen" event are the standard defenses.
A fifth pitfall is holding out from a mission-critical feature. If the feature drives retention, withholding it from 1 percent of users for a year means those users churn faster and the holdout cohort shrinks — an effect called differential attrition. The remaining sample is no longer representative, and the lift estimate becomes biased upward. The defenses are intent-to-treat analysis and capping the holdout duration before attrition gets out of hand.
If you want to drill A/B and holdout interview questions every day, NAILDD is launching with 500+ SQL and product analytics problems covering exactly these patterns.
Related reading
- A/B testing peeking — the most common interview mistake
- CUPED variance reduction in SQL
- Conversion uplift in SQL
- Difference-in-differences in SQL
- Effect size in SQL
- Delta method in SQL
- SQL window functions interview questions
FAQ
Is a holdout always about measuring incrementality?
Effectively yes. The core reason to give up feature value for some users for a long time is to estimate how much value the feature is generating. Adjacent uses — sanity checks against measurement bugs, regression detection — all reduce to the same incrementality question. If you do not need an incremental estimate, you do not need a holdout.
Can you run an A/B test and a holdout for the same feature?
Yes, and large teams routinely do. The A/B test happens before launch to decide whether to ship. If the answer is ship, the team ramps to 99 percent rather than 100 and keeps a 1 percent long-term holdout for six to twelve months. This gives the short-term win-loss decision and the long-term incrementality estimate without forcing a choice.
When should you end a holdout?
When the measurement is stable enough to make the business decision, or when the cost of withholding the feature outweighs the information value. For habit-formation metrics that means at least six months. For retention or LTV, twelve months is more honest. Some companies run permanent rotating holdouts where users cycle in and out every quarter for a continuous baseline.
Do holdouts only work for big tech?
No. Marketing teams in retail, banking, and consumer brands run holdouts constantly, often through geo experiments rather than user-level randomization. The logic — give one group the thing, withhold from another, measure the gap — works at any scale. The constraint is statistical power: small companies might need 6 to 12 months on a 5 percent holdout to detect anything.
How is a holdout different from an A/A test?
An A/A test splits users into two groups but gives both the same experience. You expect no difference; if you see one, your platform has a bug. A holdout deliberately gives the two groups different experiences and expects a difference. A/A tests validate the plumbing, holdouts measure impact. They are complementary, not substitutes.