Why run an A/A test in A/B testing
Contents:
What an A/A test actually is
Imagine a Tuesday at Stripe. The growth team has spent three weeks building a redesigned checkout, the PM is pacing in the doorway, and the experimentation lead pushes back: "We have not shipped an A/A test since the identity service rerouting. Hold the experiment for a week." By Friday they find a bug where 0.7% of mobile Safari sessions never write to the experiment log, which would have invisibly skewed every conversion comparison for the next quarter. The A/A test cost them five business days and saved them from shipping the wrong checkout to fifty million people.
An A/A test is an experiment where both buckets get the exact same product. No flag flip, no copy change, no model swap. Control versus control. On paper it sounds pointless because there should be no effect to measure. In practice it is one of the most powerful diagnostic tools a data team owns. If you split traffic into two identical groups and the system reports a statistically significant difference, something in the pipeline is broken — randomization, metric computation, preprocessing, or the test itself. Finding any of those before you run a real A/B test is the difference between trusting your readouts and accidentally promoting random noise to a roadmap decision.
Why senior teams run A/A tests
The A/A test does four jobs, each catching a different class of bug.
It validates the randomization layer. Your bucketing service is supposed to split users 50/50 with no correlation to any attribute that matters. In production that breaks more often than people admit. A hash function can skew when the salt collides with internal user IDs. A new mobile SDK can short-circuit assignment on older OS versions. A CDN can cache the experiment payload for one geography. If a 50/50 split ends up at 52/48, you have Sample Ratio Mismatch, and SRM in an A/A test is almost always a real bug rather than a fluke.
It checks metric stability. Some metrics are so noisy they generate "significant" results on pure noise — revenue per user with a long tail, time-on-page with idle-tab outliers, minutes-watched on a streaming product. An A/A test exposes that floor. If your p-values cluster below 0.05 without any treatment, the metric needs winsorizing, log transformation, a percentile cut, or variance reduction with CUPED.
It calibrates the actual false-positive rate. The textbook claim is that at alpha equal to 0.05 you get a false positive 5% of the time. That assumes a correctly specified test. Run a hundred A/A tests and count how many cross 0.05. Get fifteen and the platform is broken in a way that will poison every decision. Get only two and your standard errors are inflated, killing power.
It audits the preprocessing pipeline. Bot filtering, outlier trimming, null handling, deduplication, late-arriving events — each step can introduce asymmetry between buckets. A late-event handler that credits refunds to whichever bucket processes them, rather than whichever bucket triggered them, is invisible in a healthy-looking dashboard until an A/A exposes it.
How to run an A/A test
There are two flavors, and good teams use both.
The online A/A test is the high-fidelity version. You configure GrowthBook, Optimizely, Statsig, or your in-house tool to assign users to two buckets, both shipping the production variant. Users flow through assignment in real time, events log through the same instrumentation, the metric pipeline computes results end-to-end. This tests the whole stack. The trade-off is wall-clock time: typically one to two weeks to cover a full weekly seasonality cycle.
The offline A/A test is the fast version. You pull historical events, randomly partition users into two synthetic buckets, compute the metric on each side, and run your statistical test. Repeat thousands of times and look at the distribution of p-values. If everything is calibrated, p-values are uniform between 0 and 1 and exactly 5% fall under 0.05. This catches problems in the statistical test and metric definition, but not in the assignment service, because you never go through it.
Here is a minimal Python simulation on a revenue metric:
import numpy as np
from scipy import stats
# Per-user revenue, heavy-tailed like real e-commerce
rng = np.random.default_rng(seed=42)
data = rng.exponential(scale=500, size=10_000)
n_simulations = 1_000
p_values = []
for _ in range(n_simulations):
rng.shuffle(data)
group_a = data[:5_000]
group_b = data[5_000:]
_, p = stats.ttest_ind(group_a, group_b)
p_values.append(p)
false_positive_rate = float(np.mean(np.array(p_values) < 0.05))
print(f"False positive rate: {false_positive_rate:.1%}")
# Expect ~5% if t-test is calibrated for this metricIf the printed rate is meaningfully different from 5%, you have a problem. Heavy-tailed revenue distributions often push the t-test toward over-rejection because the central limit theorem has not fully kicked in at realistic sample sizes. Switching to Mann-Whitney, a bootstrap, or a log transformation usually pulls the false-positive rate back to alpha. For many metrics tested at once, you also need a Bonferroni or Holm correction to keep the family-wise error rate honest.
Reading the results
A healthy A/A test has three properties. Bucket sizes are within tolerance of the planned split, which you confirm with a chi-square SRM check. P-values across many runs are roughly uniform between 0 and 1, which you confirm with a Kolmogorov-Smirnov test or a visual histogram. And the realized false-positive rate sits within sampling error of alpha.
When something is off, the failure mode usually points to the cause. SRM means randomization is broken; audit the hash function and the assignment service first. A false-positive rate above seven or eight percent across a hundred simulations almost always means the statistical test is mis-specified — violated normality on a heavy-tailed metric, peeking, or a wrong standard error formula. A rate that is too low means standard errors are over-inflated, which kills power. A systematic mean shift between buckets, even when p-values look fine, points to a metric definition bug, usually in attribution windows, currency conversion, or session stitching.
How often to run them
Run an A/A test on day one of any new experimentation platform. Without it you genuinely do not know whether randomization works. Run another after any change to the platform — new hash function, new assignment service, new metric pipeline, vendor migration. Run a periodic A/A for key metrics, roughly quarterly, because data drifts, code drifts, and vendor SDKs push silent updates. Some teams at Netflix and Airbnb keep a perpetual A/A running in the background for their north-star metrics, flagging any week where the realized false-positive rate exceeds threshold.
Common pitfalls
The first pitfall is treating SRM as the only diagnostic you need. SRM only checks whether bucket sizes match the planned split. It does not check whether assignment is correlated with user attributes, whether the metric pipeline is symmetric, or whether the statistical test is calibrated. Plenty of post-mortems read "SRM passed, so we shipped the result," when the real failure was a metric definition that silently truncated revenue at a different cutoff per bucket.
The second pitfall is running a single A/A test and treating one significant result as a verdict. With alpha at 0.05, you will get a "significant" A/A about one time in twenty by pure chance — that is the entire point of the false-positive rate. A single bad result tells you nothing. To know whether your platform is broken, run many A/As and look at the distribution of outcomes, not the verdict on any one.
The third pitfall is using an underpowered A/A. If you only have a thousand users in your simulation, almost no metric will hit significance, and you will conclude the platform is healthy when the real risk is unobserved. The sample size of your A/A should match the sample size of the real A/B you plan to run, because the failure modes you care about scale with traffic.
The fourth pitfall is checking only the mean. A working A/A in aggregate can hide a broken variance estimate, a broken tail, or a broken segment. Slice by platform, country, plan tier, and signup cohort. If the metric looks healthy overall but the false-positive rate is 12% for iOS users, you have a real bug — and you will eventually ship a feature that interacts with iOS to produce a spurious win.
The fifth pitfall is fixing the A/A without re-running it. After you patch the assignment service or the metric pipeline, run a fresh A/A on the new code path. The temptation to declare victory after a hotfix is real, but every patch can introduce a new bug, and the only way to confirm health is to put the platform back through the same diagnostic.
Interview questions
Why does anyone run an A/A test? To validate that the experimentation platform is correctly randomizing users, computing metrics symmetrically, and applying a statistical test that is calibrated for the data. If an A/A test shows a significant difference, the platform cannot be trusted to evaluate a real A/B, and any decision made from one is suspect.
What do you do when the A/A test comes back significant? Do not panic from one significant result. Check SRM to confirm buckets are balanced. Re-examine the metric definition for asymmetries. Re-run the A/A several more times. One significant result in twenty is expected by design. Persistent significance across many runs points to a real bug in randomization, metric computation, or statistical inference.
How do you tell a bug from random chance in an A/A result? Run the A/A as a simulation across many iterations. If a hundred A/A simulations give a realized false-positive rate near alpha, the platform is healthy. If the rate is materially higher than alpha, the issue is systematic and you go fix the test or the pipeline. The single-run verdict is uninformative; the distribution across many runs is what matters.
Can SRM checks replace A/A testing? No. SRM checks only the size of each bucket. An A/A test checks the entire end-to-end pipeline, including how the metric is computed and whether the statistical test is calibrated. A platform can pass SRM and still produce a 20% false-positive rate because of a broken variance estimate or an attribution bug.
How does A/A testing relate to statistical power in an A/B? If your A/A test shows an inflated false-positive rate, your A/Bs will overstate winners. If it shows a deflated rate, your A/Bs are too conservative and you will miss real wins. A calibrated A/A is a precondition for any reliable A/B power calculation. You cannot trust an effect size or confidence interval if you have not first confirmed the platform reports alpha correctly on a null effect.
Related reading
- A/B testing peeking — the mistake that fails interviews
- How to calculate CUPED variance reduction in SQL
- How to calculate Bonferroni correction in SQL
If you want to drill A/B testing and SQL questions like this every day, NAILDD is launching with 500+ data analyst problems across this exact pattern.
FAQ
How long should an online A/A test run?
About the same length as a typical A/B test on the same metric. The point is to exercise the assignment service and the metric pipeline under realistic traffic, so the test needs to cover at least one full weekly seasonality cycle. One to two weeks is the usual range. Offline simulations on historical data run in minutes because no live traffic is involved, but they only catch a subset of bugs.
Do you have to run an A/A test before every A/B test?
No, and doing so would burn enormous platform capacity for marginal gain. The right cadence is one A/A on platform launch, one A/A after any meaningful change to the assignment or metric pipeline, and a periodic A/A every quarter for the core metrics. For every individual A/B, the SRM check is a cheap proxy that catches the most common randomization bugs in real time.
Can an A/A test estimate the baseline variance of my metric?
Yes, and many teams use it as a free side benefit. The A/A gives a clean estimate of how much the metric fluctuates without any treatment, which is exactly the variance you need to compute the minimum detectable effect for a future A/B test. Using the A/A-derived variance instead of a back-of-envelope guess is one of the easiest ways to plan an experiment that is actually powered for the effect size you care about.
What if the A/A shows zero significance — does that mean the platform is perfect?
Not by itself. A single A/A with no significant result is consistent with both a healthy platform and an under-powered test that could not have detected a bug if one existed. To confirm health, you want many A/A simulations and a realized false-positive rate near alpha, not zero. A rate of zero across a hundred simulations is itself a red flag because it suggests standard errors are inflated and the test is over-conservative.
Should A/A tests be visible to the rest of the company?
The result, no. The fact that they happen, yes. Senior PMs and engineers should know the experimentation platform is being audited routinely, because that is what builds trust in the readouts they consume. Sharing intermediate A/A p-values invites bikeshedding and creates pressure to "do something" about a result that is, by design, supposed to look like noise. Publish the audit cadence and the realized false-positive rate, not the per-run details.