Product manager and experimentation: how to run A/B tests well
Contents:
Why PMs need experiments
Any feature at scale is an experiment. If you ship a change to 100% of users and ask "did it get better?", you do not actually know the answer — the lift could be seasonality, a marketing push, or a competitor outage. A/B testing is the only practical way to separate the effect of your feature from background noise, and a PM who cannot reason about that gap will either ship on vibes or block on a data scientist for every decision.
The mental trigger: when you say "let's roll it out and see," ask how will we know the effect is from our change and not from external factors? If there is no answer, that is a lottery, not a launch. Strong PM orgs at Stripe, Airbnb, and Booking treat experimentation as a literacy, not a service: the PM writes the hypothesis, picks the primary metric, sizes the test, and pre-commits to the decision; the analyst checks the math.
Hypothesis and primary metric
An experiment starts with a hypothesis, not a rollout. A usable hypothesis takes the shape "if we do X, metric Y will move by Z, because W." The "because" part is what makes the hypothesis falsifiable — without it, any result can be rationalized after the fact.
Bad: "let's change the button color." Good: "if we change the primary CTA from gray to red, purchase conversion rises 1.5-3%, because red has higher contrast on a text-heavy page and reduces visual search time."
The metric you pick has to clear four bars: one primary (everything else is guardrail or secondary), sensitive (it moves quickly to changes you can ship in a sprint), tied to business value (not "clicks on the button"), and computable without one-off SQL heroics.
The most common trap is choosing a metric that conveniently shows growth but does not reflect real value. Clicks went up, conversion fell, revenue did not move — congratulations, you optimized a vanity number. Always sanity-check that a 5% move on the primary would be defended by finance at end of quarter.
Load-bearing rule: Any hypothesis you cannot write as "if X, then Y by Z, because W" is not a hypothesis — it is a wish. Stop and rewrite before you spec the test.
MDE and sample size
MDE — Minimum Detectable Effect — is the smallest effect your test can reliably catch. If your MDE is 5% and the real effect is 2%, the test will return "no difference" and you will conclude the feature is dead when in fact you just could not see it.
MDE depends on four levers: sample size, the variance of the metric, the significance level (typically alpha = 0.05), and statistical power (typically 80%). Before you launch, the PM owns this calculation: we have N users per week, what effect can we catch in two weeks? If the answer is "MDE 10%" and you realistically expect 2%, the test is pre-cooked — you will get a null and learn nothing.
Rough orders of magnitude — not guarantees — for a single variant against control:
| Metric | Baseline | Users per arm | Typical MDE |
|---|---|---|---|
| Checkout conversion | 5% | 50,000 | 2-4% relative |
| 7-day retention | 25% | 30,000 | 3-5% relative |
| Revenue per user | $40 (high variance) | 50,000 | 10-20% relative |
| Search CTR | 12% | 100,000 | 1-2% relative |
If the MDE is too coarse, you have four options: increase the sample, reduce variance (CUPED, stratification, capping outliers), swap to a more sensitive primary, or admit A/B is the wrong tool. The worst option, taken by default, is "run it anyway and hope" — that just spends two weeks producing an inconclusive readout.
Test design
Design is written before launch. Editing conditions mid-flight resets statistical validity, no matter how reasonable the edit looks in Slack.
A minimal design checklist covers six things: the randomization unit (usually user_id, sometimes session or device), duration (one full week minimum, ideally two), audience segment, guardrail metrics that must not break, an A/A sanity check on the platform, and the pre-committed decision rule for every possible outcome. Writing down the decision before the result is what prevents post-hoc negotiation when the number lands on the boundary.
| Design element | Default | When to change |
|---|---|---|
| Unit | user_id | Session for logged-out flows; device for cross-account abuse work |
| Split | 50/50 | 90/10 for many variants or high risk; 99/1 for canary-style rollouts |
| Duration | 14 days | 7 only with huge traffic; 28+ for retention or weekly-habit metrics |
| Segment | All active users | New users for onboarding; power users for core flow changes |
| Guardrails | p95 latency, error rate, opt-out rate | Add domain-specific ones (refund rate, support tickets) |
The platform itself needs to be trusted. Before any consequential test, run an A/A and confirm the null lands inside the expected false-positive rate. Teams that skip this step discover, six months in, that their bucketing leaks signal across cells.
Reading results
A result is not "the green number on the dashboard." A real readout has five layers: effect size with a confidence interval, p-value or Bayesian posterior, guardrail status, day-over-day stability, and segment breakdowns. If the effect is significant but a guardrail moved against you — do not ship. If it is significant only in one segment, ship to that segment.
Peeking is the silent killer here: if you check the dashboard every day and stop at the first "significant" day, your false-positive rate balloons from 5% to north of 25%. Either use sequential testing or wait the pre-committed duration.
Sanity check: Before you ship a winner, ask "is the absolute lift worth the cost of carrying this feature forever?" A 0.1% lift that requires a quarter of engineering to maintain is a net negative even when the p-value is beautiful.
Use this decision framework as the final filter on every readout:
Ship / Iterate / Kill:
- Ship when the primary moves in the expected direction with practical significance, guardrails are clean, and the effect holds across the segments you actually serve.
- Iterate when the direction is right but the effect is below MDE, a guardrail wobbled, or the effect concentrates in one segment that suggests a targeting change rather than a global rollout.
- Kill when the primary is flat or negative, when a guardrail breaks, or when the absolute lift cannot pay back the maintenance cost. Killing a feature is a win, not a loss — you just bought back roadmap capacity.
The anti-pattern is reading only the p-value. A change can be statistically significant and practically useless. Always pair the p-value with the absolute effect, the confidence interval, and the carry cost.
When A/B testing does not work
Not every change should be tested. Experiments are expensive in time and team attention, and in several cases they either cannot work or will mislead you.
A/B is the wrong tool when the product is too small (a hundred weekly users only catches elephant-sized effects), when strong network effects mean treatment and control users influence each other (marketplaces, social graphs), when the change is mandatory (legal, compliance, security patch), when the decision is strategic rather than incremental, or when the real effect window is longer than you can run the test.
Alternatives exist for each case: switchback designs for marketplaces, geo-splits for region-bound features, holdout cohorts for long-horizon retention, and pre/post analysis with seasonality controls for one-shot launches. Pick the alternative deliberately rather than running a doomed A/B because it is the default verb.
Experiment document template
Keep a Notion or Linear template handy. The minimum useful structure: title and owner; context; hypothesis in "if/then/because" form; primary metric with its formula; guardrails and halt thresholds; design (unit, split, segment, duration); MDE calculation and duration justification; decision rule for win, loss, and null; technical details (feature flag, event schema, dashboard link); and the post-test result and decision.
If the doc is not written before launch, the readout devolves into "let's slice it this other way" and "can we add this metric retroactively?" That is fitting the conclusion to the desired answer. Pre-registration is the cheapest defense against confirmation bias on your own work. A compact readiness checklist many teams pin to the top of the doc:
| Check | Status | Notes |
|---|---|---|
| Hypothesis written in "if/then/because" form | yes / no | |
| Primary metric defined with SQL or event spec | yes / no | |
| MDE calculated, duration justified | yes / no | |
| Guardrails listed with halt thresholds | yes / no | |
| Randomization unit chosen and bucketed | yes / no | |
| A/A run on the platform in the last 90 days | yes / no | |
| Decision rule pre-committed for ship/iterate/kill | yes / no | |
| Feature flag and event instrumentation merged | yes / no |
If any row is "no," the test is not ready to launch. This sounds bureaucratic until the third time it saves you from shipping a regression.
Common pitfalls
The most expensive pitfall is peeking and stopping early. PMs check the dashboard on day three, see a "significant" lift, and call it won. Roughly one in four such "wins" reverses on a longer run. Fix: sequential testing or strict adherence to the pre-committed duration.
The second is changing conditions mid-test. Swapping the audience segment, the bucket size, or the feature flag logic during the test resets statistical validity. If the design was wrong, end the test, fix it, and restart — do not patch in flight.
The third is ignoring guardrails. Conversion goes up 3%, page load time degrades 400ms, and the team ships because the primary won. Six weeks later, retention erodes and nobody connects the dots. Guardrails are the contract that says "we will not trade a known good for an unknown bad."
The fourth is segment farming. Slice the result twenty ways and one or two segments will always show significance by chance. Treat segment effects as hypotheses for the next test, not conclusions from this one — unless the segment was pre-registered as a planned cut.
The fifth is forgetting carry cost. A winning feature stays in the codebase forever. A 0.5% lift that requires permanent maintenance is sometimes a net loss after twelve months. Include carry cost in the ship/kill conversation, not just the lift.
The sixth is judging too soon. Novelty fades and habit builds. A feature that wins on week one may lose on week eight. For retention-touching changes, schedule a 30-day reread before declaring final.
Related reading
- A/B testing for product managers
- Peeking problem in A/B tests
- Guardrail metrics in A/B testing
- How to design an A/B test step by step
- CUPED variance reduction in A/B testing
If you want to drill PM experimentation cases and the SQL behind them, NAILDD is launching with 500+ problems across this exact pattern.
FAQ
How long should an A/B test run?
At least one full week, usually two. Less than that misses weekly cycles — weekdays and weekends behave differently for almost every product. The exact duration falls out of the MDE calculation: pick the smallest effect worth catching, plug in your traffic, and compute the required user-days per arm. If the answer is longer than you have patience for, you need more traffic or variance reduction — not a shorter test.
What significance level should I use?
Defaults are alpha = 0.05 and power = 80%. These are conventions, not laws — a high-risk change (pricing, payments) deserves tighter alpha, and a low-stakes change can use looser power. Document the choice in the design doc so the readout cannot be re-litigated later.
What if the effect is significant but a guardrail breaks?
Do not ship. Find the cause of the guardrail regression, fix it, and rerun the test. Buying a 3% conversion lift with a 5% latency regression is usually net zero or net negative at the company level, because latency damage compounds across every other surface.
How do I know my experimentation platform is working?
Run A/A tests periodically and confirm the false-positive rate sits near your alpha. If 5% of A/A tests show "significance" on the primary, the bucketing is healthy. If 20% do, you have a leak somewhere — usually event deduplication or assignment timing.
Can I test against several metrics at once?
You can, but you need a multiple-comparison correction (Bonferroni, Benjamini-Hochberg). Without it, the chance of at least one false positive grows with each extra metric. Cleaner discipline: pick one primary, treat the rest as secondary signals for the next test.
What is CUPED in plain English?
CUPED reduces the variance of your primary metric by adjusting for each user's pre-experiment behavior. Same sample size, smaller MDE. It is the single highest-leverage variance reduction technique most teams under-use.
How many A/B tests actually win?
Public numbers from large platforms put it in the 10-30% range. Most tests show no significant effect or a negative one, and that is normal. A team where 80% of tests "win" is almost certainly peeking or fishing for segments.