May 18, 2026·13 min read

How a holdout test works

Prep A/B testing and statistics

300+ questions on experiment design, sample size, p-values, and pitfalls.

Contents:

Why holdouts exist
What a holdout actually is
Types of holdout
How to configure one
SQL for the readout
Common pitfalls
Statistical power and sizing
Holdout in interviews
Related reading
FAQ

Why holdouts exist

You ship a feature. The A/B reads +5% on 30-day retention, the launch review goes well, the team moves on. Three months later your VP asks the obvious question: is that +5% still there, or did it fade? A standard A/B cannot answer this. The experiment ended when you ramped to 100%, the control group dissolved, and from that point on every user is on the new build.

That gap is what holdout tests fill. A holdout is a small slice of users — typically 1 to 5 percent — who never receive new features. While the other 95 to 99 percent of the product evolves week by week, this group sits frozen at the baseline. Six months later you compare the two and get a single number: the cumulative impact of everything shipped in that window. Not feature A or feature B in isolation. The whole portfolio.

This is the only honest way to answer "did our roadmap move the metric?" at the org level. At Meta, Netflix, Airbnb, and DoorDash, holdouts are standard infrastructure. Showing you understand them in an analytics or product DS interview is a clean signal that you have worked on something past the prototype stage.

What a holdout actually is

A holdout is a permanent control group. Permanent is the load-bearing word. Unlike a normal A/B where assignment lasts the duration of the test, holdout assignment lasts months or years. A user assigned to the global holdout on January 1 sees the January 1 product on March 1, on June 1, and possibly into the following year. Every feature flag checks the holdout marker first and serves the legacy path if the user is in it.

The comparison is straightforward: pick a north-star metric — retention, revenue per user, daily active rate — and compute it on the holdout versus everyone else. If your team launched four features that each individually A/B tested at +1 percent, you might naively expect the holdout to show +4 percent. It almost never does. Features overlap, deprecate each other, fight for the same surface area, and lose novelty. The honest number is whatever the holdout says, and it is usually smaller than the sum of the A/B deltas — sometimes by a factor of two. That gap between "sum of A/B wins" and "what the holdout reads" is the most important number a product analytics team can put in front of leadership.

Types of holdout

There are three patterns you will see in practice and they are not interchangeable.

A global holdout is a single group excluded from every feature shipped by every team. This is the cleanest design and gives you a single org-level number. It is also the most expensive — those users get the worst experience indefinitely. Companies that run global holdouts keep them small, usually 1 to 2 percent, and rotate users out periodically so no one is stuck on the frozen build forever.

A team holdout scopes the exclusion to one team's surface area. Growth's holdout never receives Growth's experiments, but Search's holdout might. This is more politically tractable in large orgs because each team owns its own holdout, but the math gets messier — you cannot easily aggregate to a single product-level number because users overlap across teams. Airbnb and Uber both run something close to this model.

A feature-level long-term holdout is temporary. After a feature ships and the A/B ends, you keep a small slice — 0.5 to 1 percent — on the control variant for another three to six months specifically to measure decay. This is the right tool when you suspect a specific feature's win was driven by novelty rather than durable behavior change. It costs less than a permanent holdout because you tear it down after the question is answered.

How to configure one

The mechanics are not complicated, but they have to be implemented consistently or the whole thing fails. Start with assignment: hash the user_id (or device_id for logged-out flows) into a bucket between 0 and 99, and assign bucket 0 or 1 to the global holdout. This must be deterministic — the same user lands in the same bucket every time, across sessions and devices. LaunchDarkly, Statsig, Eppo, Optimizely, and in-house systems at Meta and Airbnb all expose this primitive directly.

Once assignment is wired up, every feature flag needs to check the holdout marker before serving the new variant: if user is in holdout, serve legacy; otherwise apply the normal experiment logic. Build this as a wrapper around your flag SDK so an engineer cannot ship a feature that bypasses the check. The most common failure mode for holdouts is not statistical — it is one engineer forgetting to wrap their flag.

Reporting needs to be a recurring view: north-star metric, sliced by holdout vs treatment, with rolling 7, 30, and 90-day windows. Most companies surface this in a weekly leadership dashboard.

SQL for the readout

The analysis query is almost embarrassingly simple. The complexity lives in instrumentation, not the readout. Here is a stripped-down version for 30-day retention:

SELECT
    CASE WHEN is_holdout THEN 'holdout' ELSE 'treatment' END AS grp,
    COUNT(DISTINCT user_id)                                  AS users,
    AVG(retained_30d)                                        AS retention_rate,
    AVG(revenue_30d)                                         AS arpu_30d
FROM dim_users
WHERE signup_date BETWEEN '2026-01-01' AND '2026-03-31'
GROUP BY 1;

For a proper readout you also want a confidence interval on the delta. A two-proportion z-test works for retention, and a bootstrap on the difference of means works for revenue (heavy-tailed). At larger samples a normal approximation is fine:

WITH stats AS (
  SELECT
    is_holdout,
    AVG(retained_30d)                                AS p,
    VARIANCE(retained_30d)                           AS var,
    COUNT(*)                                         AS n
  FROM dim_users
  WHERE signup_date BETWEEN '2026-01-01' AND '2026-03-31'
  GROUP BY 1
)
SELECT
  (t.p - h.p)                                                              AS lift,
  (t.p - h.p) - 1.96 * SQRT(t.var/t.n + h.var/h.n)                         AS ci_low,
  (t.p - h.p) + 1.96 * SQRT(t.var/t.n + h.var/h.n)                         AS ci_high
FROM stats t, stats h
WHERE t.is_holdout = FALSE AND h.is_holdout = TRUE;

The production query should also break the delta down by signup cohort, country, and platform — long-term holdouts often show very different effects on iOS vs Android, and a single aggregate number can mask segment-level reversals.

Prep A/B testing and statistics

300+ questions on experiment design, sample size, p-values, and pitfalls.

Join the waitlist

Common pitfalls

The most expensive holdout mistake is a leak. A leak happens when one team ships a feature that does not respect the holdout flag, so some users in the holdout end up with the new behavior anyway. Once that happens, your holdout no longer measures what you think it measures — the "control" group is contaminated with treatments. Leaks are usually silent: nothing breaks, the dashboards still render, the numbers just become biased toward zero. The defense is process: a single shared SDK wrapper, mandatory code-owner review on every flag change, and an automated weekly check comparing feature exposure rates between holdout and non-holdout.

Sizing a holdout too small is the second classic failure. If you put 0.5 percent of a 200,000-user app into holdout, you have a control group of 1,000 users and noise dominates the signal. The dashboard bounces 3 to 5 percentage points week to week and you conclude nothing. Size against the smallest lift you need to detect — if the cumulative effect of your roadmap is +2 percent on retention, you need enough users in the holdout to detect that confidently.

Comparing the holdout against a pre-launch baseline instead of the contemporaneous treatment group is a third trap that ruins the analysis. The whole point of a holdout is that it gives you a parallel control — both groups face the same seasonality, the same external events, the same marketing campaigns. Comparing today's holdout against last year's pre-feature numbers is pre-post analysis with no controls, and seasonality alone can produce double-digit swings.

A subtler problem is operational: holdout users are receiving a worse product than everyone else. For consumer apps this is usually fine — the delta is small and the users do not know. For high-stakes domains (financial, medical, accessibility) you need an explicit policy on what improvements never get held out. Most teams carve out an exception list — security fixes, accessibility improvements, critical bug fixes — that ships to everyone including the holdout.

The last pitfall is treating the holdout number as the only number that matters. The holdout tells you cumulative impact but not which feature drove it. You still need per-feature A/B tests to make ship decisions and understand what is working. The holdout is the audit; the A/Bs are the operating telemetry.

Statistical power and sizing

The math on holdout sizing is the same as any two-sample test, just with a very unbalanced split. If you have 10 million users and a 1 percent holdout, you have 100,000 in control and 9.9 million in treatment. For detecting a 1 percent absolute lift on a 40 percent baseline retention rate, that easily gives you the power to detect a 0.3 to 0.5 percentage point shift over a 30-day window.

The bottleneck is almost never the treatment side, because it is enormous. The bottleneck is the holdout, which is small by design. Below roughly 50,000 users in the holdout, you lose the ability to detect anything under a few percent and the dashboard becomes too noisy for leadership reporting. For apps below 100,000 weekly actives, a permanent holdout is probably the wrong tool — use sharper individual A/B tests until you cross the million-user mark.

Holdout in interviews

For a senior analytics or product DS role, the holdout question shows up in two forms. The open-ended "how would you measure your team's impact over a year" prompt, where the strong answer goes straight to a holdout design. And the technical "what is the difference between a holdout and an A/B test" — testing whether you understand that holdouts measure cumulative long-term impact while A/Bs measure individual feature deltas in a fixed window.

A good answer covers the design (1 to 5 percent permanent control), the configuration (every flag checks the holdout marker), the analysis (north-star metric, treatment minus holdout, with a confidence interval), and the failure modes (leaks, undersizing, comparing against pre-launch baselines). If you can also speak to the org-level use case — running this as the audit that backs the OKR review — you sound like someone who has shipped this in production.

If you want to drill questions like this on a real interview pace, NAILDD ships product analytics scenarios in the same format every day.

FAQ

What is the optimal holdout size?

For most consumer apps with millions of users, 1 to 2 percent is the sweet spot. It is large enough to give tight confidence intervals on cumulative effects of a few tenths of a percentage point, and small enough that the opportunity cost stays manageable. Going above 5 percent rarely buys meaningful statistical gain and materially degrades product velocity, because every feature must justify itself against a larger held-out cohort. Below 0.5 percent the dashboards become too noisy to trust week to week.

Does every company need a holdout test?

No. Holdouts make sense when three conditions are true: a user base large enough that 1 to 2 percent is tens of thousands of people, continuous feature shipping so cumulative impact is a real question, and a north-star metric tracked over multi-month windows. Consumer apps at scale meet all three. B2B SaaS with 500 enterprise customers does not — you cannot meaningfully hold out 5 customers, and the right tool there is account-level case studies. Early-stage products under 100,000 users should skip holdouts and invest in sharper A/B testing.

Can you run a holdout and individual A/B tests at the same time?

Yes, and you should. The two tools answer different questions. Individual A/Bs tell you whether a specific feature is good enough to ship and let you attribute roadmap decisions to outcomes. The holdout tells you whether the sum of those decisions is actually moving the org-level metric. The standard production setup runs both: A/Bs feed weekly ship decisions, and the holdout feeds the quarterly OKR review. The implementations are orthogonal — an A/B variant is a temporary split within the treatment population, and the holdout is excluded from both arms of every A/B.

What is the difference between a holdout and a back-test?

A holdout is forward-looking: you set up the group today, ship features in the months ahead, and read the cumulative effect later. A back-test is retrospective: you reconstruct what would have happened if a feature had not shipped, usually by modeling. Holdouts are causally clean because the control group is real users in real time experiencing real seasonality. Back-tests are only as good as the modeling assumptions. When you can afford the opportunity cost, holdouts are always the stronger evidence.

How long should a holdout run before you read the result?

A minimum of 90 days. Shorter and you measure novelty rather than durable behavior change — many features look great in week one and decay sharply by week six. Six to twelve months is the sweet spot for org-level reads, because by then you have shipped enough into treatment for the cumulative effect to compound. Beyond a year, rotation policies matter — you do not want the same users in the frozen experience for multiple consecutive years.