A/B testing explained simply
Contents:
What an A/B test actually is
An A/B test is the simplest controlled experiment you can run on a product. You flip a fair coin for each incoming user, show half the control experience and half the treatment experience, then compare a single metric — conversion, revenue per user, retention — and ask one boring question: did the treatment move the number more than random noise would have on its own?
The reason this matters is not that it sounds scientific. It is that human intuition about what works in a product is unreliable, and the bigger the company the more expensive that unreliability gets. A senior PM at Netflix or Airbnb is not running tests because they love statistics — they run them because shipping a confidently-wrong redesign to 300 million users is enormous risk, and a two-week experiment on 1% of traffic costs almost nothing in comparison.
The one rule to remember: if the only reason you believe a change is good is that you shipped it, you do not yet know whether the change is good.
How the experiment loop works
Every A/B test, from the simplest button-color test at a five-person startup to a multivariate launch at Meta, walks through the same six steps in the same order. Skip any of them and your conclusion is decoration, not evidence.
Step 1 — write the hypothesis. A hypothesis is a sentence in the form "if we change X, then metric Y will move by at least Z, because of reason R." The "because" matters. A test without a mechanism teaches you nothing even when it wins — you cannot generalize the result to the next change.
Step 2 — pick the primary metric and the guardrails. The primary metric is the one number you will act on. Guardrails are the metrics you check to make sure you did not break anything else. Picking guardrails before you start is what separates a real experimentation culture from theater.
Step 3 — compute the sample size. Given a baseline rate, a minimum detectable effect (MDE), an alpha of 0.05 and power of 0.80, you can solve for the number of users per arm. For a 5% baseline conversion and a 10% relative MDE, you need roughly 30,000 users per arm. Without this number you are guessing.
Step 4 — split the traffic. A randomization service (LaunchDarkly, Statsig, GrowthBook) hashes each user ID and routes them deterministically into control or treatment, so a user sees the same variant on every visit.
Step 5 — let it run. Do not look. Do not call it early. Do not extend it because the result is "almost significant." This is the single hardest part for first-time experimenters.
Step 6 — analyze and decide. Compute the lift, the confidence interval, and the p-value. Check guardrails. Make a call: ship, do not ship, or iterate.
The vocabulary you need
Most A/B test confusion is vocabulary confusion. Here are the words that matter, with the definitions you actually need on the job — not the textbook ones.
| Term | What it means in practice | Typical value |
|---|---|---|
| Primary metric | The single number you decide on | Conversion, RPU, D7 retention |
| MDE | Smallest lift you would care about | 1-10% relative |
| Alpha (α) | False positive rate you accept | 0.05 |
| Power (1 - β) | Chance of catching a real effect | 0.80 |
| Sample size per arm | Users needed in each group | 5k - 500k |
| p-value | Prob. of seeing this lift by chance | < 0.05 = significant |
| Confidence interval | Plausible range for the true lift | e.g. +1.2% to +4.8% |
| Guardrail metric | What must not break | Latency, refunds, retention |
| SRM | Sample ratio mismatch — broken split | Alarm if p < 0.001 |
The two that trip people up the most are MDE and p-value. MDE is a planning input — you choose it before you start, based on what lift would be worth shipping. The p-value is an output — it tells you how surprised you should be if there was actually no effect. Confusing the two leads to the most common rookie mistake: testing forever to "find significance" on a lift that was never large enough to matter.
Load-bearing trick: if the confidence interval crosses zero, the test is inconclusive. If it does not, the direction is real. The p-value is just a compressed version of this same fact.
Reading the result
Imagine you ran a checkout-button color test for two weeks. The control group of 10,000 users had a 5.0% conversion rate. The treatment group had a 5.5% conversion rate. The relative lift is +10%, the absolute lift is +0.5 percentage points, and the p-value comes back at 0.04.
Three things are true here, and beginners usually get one wrong.
First, the result is statistically significant at the standard 0.05 threshold: if there were truly no effect, we would see a lift this large or larger only 4% of the time by chance. It does not mean "there is a 96% probability the treatment is better" — that is a Bayesian statement requiring different math.
Second, the size of the lift matters more than the p-value. A +0.5 pp lift on a 5% baseline is a +10% relative improvement — large enough to ship. If the same p-value came with a +0.05 pp lift, you would shrug and move on, even though both are "significant."
Third, the confidence interval is doing the real work. A 95% CI of [+0.05 pp, +0.95 pp] tells you the true effect is plausibly anywhere from "barely noticeable" to "double-digit relative lift." That spread tells you whether to ship confidently or run a follow-up.
Nobody tells you this in stats class: when you ship a winning test, the realized impact in production is almost always smaller than the point estimate. Regression to the mean, novelty effects, and selection bias all bite. Discount the lift you saw by 20-30% for planning and you will be more right more often.
Common pitfalls
The mistakes below are the ones that show up over and over in real experimentation reviews. Each one has killed real tests at real companies. Read them before you launch anything, not after.
The first and most common failure is sample-size theater — running a test on a few hundred users and treating the result as real. With n=200 per arm on a 5% baseline, you cannot detect anything smaller than a roughly 70% relative lift with normal power, and most product changes do not move metrics by 70%. The test will almost always come back "not significant" — but that result tells you nothing, because you never had the horsepower to detect a realistic effect. Run a power calculation before launch; if the required sample is unreachable, do not run the test.
The second is peeking. You launch on Monday, the test looks promising on Wednesday, you call it. The problem is that p-values jiggle around as data comes in, and if you keep checking and stopping whenever the number crosses 0.05, your real false-positive rate climbs to 20-30%. Either commit to a fixed end date, or use a sequential testing framework designed for early stopping.
The third is ignoring guardrails. Your primary metric went up, conversion is statistically significant, congratulations — and meanwhile, refunds rose 15% and 30-day retention is down 2 percentage points. Write guardrails into the experiment spec before launch and treat any significant degradation as a blocker, no matter how good the primary looks.
The fourth is multiple comparisons without correction. You look at the primary metric, then 12 segment slices, then 8 secondary metrics. You will find something "significant" by accident — that is what 0.05 means when you make 20 comparisons. Pre-register your analyses, apply a correction like Bonferroni or Benjamini-Hochberg, or treat anything outside the pre-registered primary as exploratory only.
The fifth is shipping on a tiny absolute lift. A statistically significant +0.05 pp lift on a 5% baseline is technically real — and a 1% relative improvement that will not survive the noise of production. Act on lifts that matter, not lifts that pass an arbitrary threshold.
When an A/B test is the wrong tool
A/B testing is the default at companies that ship to large user bases, but it is not always the right choice. There are at least four situations where reaching for a test is the wrong instinct.
If the change is a non-negotiable fix — a security patch, a legal requirement, a bug that breaks the product — you do not test it, you ship it. You are not asking whether users like the GDPR cookie banner. You are required to have it.
If your traffic is too small to reach statistical power within a reasonable window, an A/B test is mostly an excuse to delay. With 500 sign-ups per week, a typical conversion test would need to run for over a year. In that case, ship and instrument carefully, or use pre/post analysis with a control market.
If you are running dozens of small variants and want the platform to learn for you, multi-armed bandits beat A/B tests on efficiency. Bandits allocate more traffic to winning arms as evidence accumulates, which is great for short-lived content (homepage hero copy, email subject lines) and bad for long-term product decisions.
If the change is inherently un-A/B-testable — a brand redesign, a market-launch in a new geo, a one-shot pricing change — you have to use other tools: switchback experiments, geo splits, synthetic controls, or just careful pre/post measurement.
If you want to drill A/B testing question-by-question — the kind of scenarios that come up in real interviews and real PM/DS reviews — naildd has a growing set of A/B testing problems with worked solutions.
Related reading
- How to design an A/B test step by step
- A/B testing peeking mistake
- Sample size calculator guide
- P-value explained simply
- Why run an A/A test
- SRM — sample ratio mismatch
FAQ
How long should an A/B test run?
Most product A/B tests run one to three weeks. Less than a week and you have not seen a full weekly cycle, which means you can be fooled by weekend-vs-weekday effects, payday spikes, or other periodic patterns. More than three weeks and you start running into external noise — competitor launches, marketing campaigns, seasonality — that contaminates the comparison. The right answer is "long enough to hit your pre-computed sample size and at least one full weekly cycle."
How many users do I actually need?
It depends on three things: the baseline rate of the metric, the smallest effect you would care about (the MDE), and your significance and power thresholds. For a 5% baseline conversion and a 10% relative MDE at the standard 0.05 / 0.80 settings, the answer is roughly 30,000 users per arm. For a much rarer event — say a 1% baseline — the same MDE pushes you to over 150,000 per arm. The cheapest way to need fewer users is to pick a more sensitive metric upstream of the rare event.
Can I test more than two variants at once?
You can run A/B/C/D tests, and large companies do all the time. The two costs to be aware of are: you need more traffic — the total sample required grows roughly linearly with the number of arms — and you have to correct for multiple comparisons when interpreting results. If your traffic is constrained, run sequential A/B tests instead of one big multi-arm test. If you have plenty of traffic and want to compare many small variants quickly, multi-arm bandits will outperform a fixed-allocation A/B/C/D.
Why did my "significant" winner not move the metric in production?
Usually one of three things: regression to the mean (you picked a variant because its observed lift was high, which on average overstates the true effect), novelty effects (users react to anything new and the lift decays as novelty fades), or interaction effects with other launches shipping in parallel. Discount your observed lift by 20-30% for planning, and run a holdout if the launch is large enough.
Is A/B testing always the right call?
No. It is right when you have enough traffic, the metric can be measured in a reasonable window, and the cost of a wrong decision is larger than two weeks of slower shipping. If any condition fails — small startup, slow-loop metric like annual retention, or a tiny reversible change — the test is overhead. Ask "what would I do with each possible outcome?" If the answer is the same, do not run the test.
What is the difference between statistical and business significance?
Statistical significance says the lift is unlikely to be zero. Business significance says the lift is large enough to matter. These are independent. With enough sample size, almost any difference becomes statistically significant — that does not mean it is worth shipping. Focus on confidence intervals and effect sizes, not just whether the p-value crossed an arbitrary threshold.