How to design an A/B test step by step

Prep A/B testing and statistics
300+ questions on experiment design, sample size, p-values, and pitfalls.
Join the waitlist

Why design beats execution

A bad A/B test is not a bug — it is a plan you wrote in your head five minutes before the launch button. By the time the experiment runs, every load-bearing decision is locked in: metric, win bar, duration, bucketing. Statistics at the end will not save those. You ship a feature that hurts the business or kill one that would have helped, and the only person who finds out is the next analyst who has to re-run it.

Senior interviewers at Stripe, Airbnb, Netflix, and DoorDash know this, which is why "design an A/B test for feature X" is one of the most common product DS questions. They want to see whether you know which decisions must be made before traffic starts flowing, and which can wait. The ten-step workflow below is the scaffold that gets you through that question — and through a real launch — without forgetting anything.

The ten-step workflow

Sequential on paper, iterative in practice. Steps 2 to 5 loop: pick a metric, realize the sample size is impossible, walk back the MDE, swap in a proxy, re-check. What is not normal is jumping from "let us test this" to "let us launch this" without writing anything down.

Step 1: Write the hypothesis

A hypothesis is one sentence with a mechanism: "if we change X, then Y will improve because Z." X is concrete and shippable. Y is a metric you already track. Z is the causal story you believe. Without Z, you are guessing.

Bad: "the new checkout will be better." Good: "if we remove the optional phone field, completed-purchase rate will rise by at least 1.5 pp because fewer mobile users abandon when typing a phone number on a small keyboard." The second is falsifiable: if completion moves only 0.2 pp, you learned the phone field is not the blocker, and that is worth something.

Step 2: Pick the primary metric

One primary. Not two. It must be measurable in your warehouse today, sensitive enough to move within the experiment duration, and pointed at the business outcome you care about. Revenue per user is the cleanest business metric but has heavy variance, so most teams use a proxy like conversion or activation and accept the proxy can drift from revenue over time. Define it as a SQL expression now, not on launch day.

-- Primary metric: lock this definition before launch
WITH assignments AS (
  SELECT user_id, variant, assigned_at
  FROM experiment_assignments
  WHERE experiment_id = 'checkout_phone_field_v1'
),
outcomes AS (
  SELECT
    a.user_id,
    a.variant,
    MAX(CASE WHEN o.order_id IS NOT NULL THEN 1 ELSE 0 END) AS converted
  FROM assignments a
  LEFT JOIN orders o
    ON o.user_id = a.user_id
   AND o.created_at BETWEEN a.assigned_at AND a.assigned_at + INTERVAL '7 days'
  GROUP BY a.user_id, a.variant
)
SELECT variant, COUNT(*) AS users, AVG(converted) AS conversion_rate
FROM outcomes
GROUP BY variant;

Step 3: Define secondary and guardrail metrics

Secondaries explain the primary. If conversion rises but AOV drops because customers are rushing through, you want to know. Guardrails must not get worse: refund rate, support tickets, crash rate, p95 latency. A treatment that lifts the primary 5% but triples refunds is a loss, and you only catch that if refunds are pre-declared. Three to five guardrails maximum.

Step 4: Compute sample size

Four inputs: significance (usually 0.05), power (usually 0.80), baseline rate, and MDE — the smallest lift that would actually change a business decision. If a 0.1 pp lift on a 3% baseline would not ship anyway, do not size for it.

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

baseline = 0.045          # current checkout conversion
mde_relative = 0.05       # 5% relative lift
treatment = baseline * (1 + mde_relative)

effect = proportion_effectsize(treatment, baseline)
n_per_arm = NormalIndPower().solve_power(
    effect_size=effect, alpha=0.05, power=0.80, alternative='two-sided',
)
print(f"Need {int(n_per_arm):,} users per arm")

If the answer is "800k per arm and we have 30k weekly," solve that before launch — widen the MDE, change the analysis unit, or use variance reduction like CUPED.

Step 5: Randomization unit and strategy

The randomization unit must match the analysis unit. If a user can see treatment from multiple devices, randomize on user_id, not session. For marketplaces with network effects, cluster-randomize whole cities or supply pools to stop variants leaking into each other. Hash the unit ID with a deterministic function so assignment is stable. Stratify on platform or country if those dimensions move the primary a lot; stratification cuts variance for free.

Step 6: Lock the duration

Two weeks is the floor for any consumer experiment because weekly seasonality matters — weekend buyers behave differently from weekday buyers, and a five-day run systematically misses one population. Pre-register the stopping rule: fixed N, fixed date, or a sequential test with proper alpha-spending.

Step 7: Launch with safeguards

Ramp: 1% for one day, then 10%, 50%, full. The ramp catches crashes and billing bugs before you commit the whole experiment to a broken treatment. Use a feature flag system you can flip off in seconds. Log variant assignment in your event stream, not just the experiment platform — that is what lets you reconstruct the experiment if the platform has bugs.

Step 8: Monitor without peeking the primary

Peeking the primary mid-flight and stopping when it looks significant inflates the false positive rate. A test designed for alpha = 0.05 can have effective alpha of 0.20+ if you check daily and stop on good news. Guardrails are different — check those daily; the point is to kill the experiment if something is on fire. Sample ratio mismatch (SRM) is the most useful daily check: if your 50/50 split is actually 51/49 with a chi-square p-value below 0.001, your assignment is broken.

Step 9: Analyze with intent

Run the analysis you pre-registered, in the order you pre-registered. Verify SRM first. Compute the primary with a confidence interval, not just a p-value — the CI tells the PM whether to ship even when the result is significant. Run only the pre-registered segment cuts; do not fish across twenty new segments. Any decent ML lead will catch that in review.

Step 10: Make the ship decision

Rarely "ship if p < 0.05." It is a four-way matrix: primary up, guardrails clean — ship. Primary up, guardrail breached — escalate the tradeoff, often do not ship. Primary flat — do not ship, but document what you learned about the MDE. Primary down — definitely do not ship, and write the postmortem.

A reusable plan template

Paste this into a doc before launch. Filling it in is the design step.

# A/B Test Plan: [feature name]

## Hypothesis
If we change [X], then [primary Y] will improve by at least [MDE]
because [mechanism Z].

## Primary metric
- Definition: [SQL or formula]
- Baseline: [current rate, last 4 weeks]
- MDE: [absolute or relative]

## Secondary metrics
- [Metric 1] — explains primary by [reason]

## Guardrails
- [Refund rate] — must not increase by more than [X]
- [Latency p95] — must not increase by more than [X]

## Sample size
- Alpha: 0.05, Power: 0.80
- N per arm: [computed]
- Required duration: [weeks]

## Randomization
- Unit: user_id, Split: 50/50
- Stratification: platform, country

## Analysis plan
- Primary test: two-proportion z-test
- Pre-declared segments: platform, country, new vs returning
- Stopping rule: fixed N or fixed end date

## Ship criteria
- Primary lift >= MDE AND CI excludes zero AND no guardrail breach
Prep A/B testing and statistics
300+ questions on experiment design, sample size, p-values, and pitfalls.
Join the waitlist

Common pitfalls

Most failed experiments fail in design, not analysis. The pattern that comes up over and over is that the team picked a metric they could not actually move in the time they had, then chased false positives across segment cuts when the primary did not budge. The fix is honesty about MDE before launch. If your traffic cannot detect a 1% relative lift in eight weeks, do not pretend it can — pick a more sensitive metric, use CUPED to reduce variance, or accept that this experiment will only catch large effects.

A second trap is sample ratio mismatch that nobody checks. If your assignment service crashes silently for a subset of users, or your pipeline drops one variant's logs more than the other's, your 50/50 split is no longer 50/50 and every downstream comparison is biased. The fix is to run an SRM chi-square test as the first step of every analysis script and refuse to interpret the primary if SRM fails. This single check has saved more careers than any statistical method.

The third pitfall is treatment leakage. In a recommendations experiment, a treatment user shares a link with a control user, who now sees treatment content. In a pricing experiment, a treatment user mentions a discount in a forum and control users complain to support. Leakage shrinks the measured effect, sometimes to nothing, and the only signal is a primary that moves less than your back-of-envelope model predicted. The fix is cluster randomization at the boundary where leakage actually crosses.

A fourth pitfall is changing the experiment mid-flight. Adding a third variant after week one, swapping the primary because the original looks flat, extending duration "just a little" — each inflates effective alpha and corrupts the result. The fix is the pre-registered plan: sign it before launch, and require a written amendment with re-computed alpha for any change.

Finally, ignoring novelty and primacy effects. Users who see a brand-new UI in week one behave differently from users who see the same UI in week four. If a two-week experiment shows a 10% lift, ask how much survives month three. The fix is to run long enough for novelty to decay (four to six weeks for UI changes), or to track the effect over time and look for a stable plateau rather than the peak.

Design variants beyond classic A/B

Classic two-arm A/B is the right default, not the only option. Multivariate tests with three or more arms compare alternatives at the cost of more traffic per arm. Factorial (2x2) designs test multiple variables in the same population and measure both main effects and interactions. Holdout tests reserve a slice of users who never get new features, giving long-run cumulative impact across a quarter. Switchback designs alternate the same users between treatment and control over time windows, the standard for marketplaces where two-sided network effects make user-level randomization unsound. If traffic is low, CUPED reduces variance by adjusting the primary for a pre-experiment covariate, often shrinking sample size by 40% or more.

To drill A/B design and analysis questions every day, NAILDD is launching with 500+ data science problems built around this kind of end-to-end experimental thinking.

FAQ

Do I always need an A/B test before shipping?

No. Treating A/B as a gate for every change slows your team to a crawl. Big, expensive, or hard-to-reverse changes — checkout flow, pricing, recommendation algorithm — should be tested. Small bug fixes, copy edits, and changes where the worst case is "no measurable effect" can ship straight. Ask "how much would I regret shipping this blind?" — if a lot, design an experiment; if barely, save the traffic for the next big swing.

How is a Bayesian A/B test different?

A Bayesian test computes the probability the treatment is better than control given the data and a prior. Frequentist tests compute the probability of seeing this data if no effect exists. Bayesian results are easier to communicate ("92% chance treatment wins") and handle peeking more gracefully because the posterior updates continuously without inflating false positives. The cost is the prior — a bad one shifts conclusions.

How many experiments can run in parallel?

Five to ten major experiments per team is fine if they touch different surfaces or user populations. Risk rises when two experiments share a metric or UI element — your clean A vs B becomes A intersected with whatever else is running. Above ten in parallel, log every user's full experiment vector and periodically check whether the primary responds to combinations rather than singletons.

What if the primary is flat but a secondary moved a lot?

You did not ship. The pre-registered primary is the decision metric; a moved secondary is a hypothesis for the next experiment, not a justification for shipping this one. If you check ten secondaries, one will look significant by chance. Write it up as an insight, design the next experiment with that secondary as the primary, and move on.

How do I handle ratio metrics like revenue per session?

The standard t-test assumes independent observations, and sessions per user vary across users. The delta method is the standard correction: it computes the variance of the ratio by accounting for the covariance between numerator and denominator. Compare revenue per session naively with a t-test and your CIs will be too narrow. Use the delta method, or switch to user-level metrics where each user contributes one observation.

One-sided or two-sided tests?

Two-sided, almost always. A one-sided test assumes the effect can only go in your predicted direction, which is rarely true — UI changes backfire and pricing changes cannibalize. "We only care about positive lift" hides the fact that you also care about negative lift you might miss.