Null hypothesis explained simply
Contents:
Why H0 matters in interviews
Picture a Tuesday at Stripe. A PM pings you with a Looker screenshot: "we shipped the new checkout button last Thursday, conversion looks up, can we declare a win". The candidate who answers "looks good to me" fails the senior bar. The candidate who writes back "what is the null hypothesis and what alpha did the team pre-register" gets put on the path to staff. That sentence is the difference between an analyst who runs queries and one who defends decisions.
Every A/B readout at Meta, every fraud flag at DoorDash, and every uplift claim at Netflix sits on top of one explicit null hypothesis. Interview loops at all of these companies probe whether you can write H0 out loud, in symbols, before touching a t-statistic. If you cannot, the rest of the conversation about p-values, power, and Type I errors collapses, because all of them are defined relative to H0.
For a middle or senior analyst, fluency with H0 separates "I can read a stats output" from "I can argue with a director about whether a launch was real". The framework is small and the traps are well known, yet almost every junior candidate trips on at least one in a recorded loop.
The courtroom analogy
A courtroom gives the cleanest mental model for H0. The defendant is presumed innocent, which is the null: no wrongdoing, no effect. The prosecution argues the alternative: the defendant did it, there is an effect. The jury never proves innocence. They either find enough evidence to reject the presumption, or they do not, and in the second case the verdict is "not guilty" rather than "innocent".
Statistics works the same way. H0 says the new button color does not change conversion. H1 says it does. We never prove H0. We either gather enough data to reject it at the chosen significance level, or we report "we failed to reject H0" rather than "H0 is true". An interviewer who hears "we accepted the null" knows the candidate has not internalized the asymmetry, because the data cannot distinguish "no effect" from "effect too small to detect with this sample".
Three worked examples
The fastest way to lock the framework in is three small examples that cover the patterns interviewers reuse.
Example one is the canonical A/B test of a button color at Airbnb. H0 says the treatment conversion rate equals control. H1 says they differ. The team runs for two weeks, computes a two-proportion z-test, and gets p = 0.03. Because 0.03 is below the pre-registered alpha of 0.05, the team rejects H0 and concludes the new color shifted conversion. The decision is symmetric: p = 0.18 would lead to "fail to reject", not "H0 is true".
Example two is a comparison against a benchmark. A Snowflake-using marketplace believes its conversion rate is above the industry average of 3 percent. H0 says the true rate equals 3 percent; H1 says it differs. A one-sample t-test on daily conversion returns a p-value. The framing is identical to the A/B case, except the comparison is against a fixed number rather than a sibling group. This variant trips candidates who think hypothesis testing always needs two groups.
Example three is a latency regression. Users complain that a Notion-style note app feels slower after a release. H0 says the median page load time after equals before. H1 says it grew. A Mann-Whitney U or permutation test returns a p-value. The pattern stays the same: pre-register alpha, state H0 in symbols, run the test, then decide whether to reject.
Why we reject and never accept H0
The hardest sentence to drill into junior analysts is "absence of evidence is not evidence of absence". A non-significant result can mean two very different things. Either there really is no effect, or there is a real effect that the sample was too small to detect. Without a power calculation, you cannot distinguish them.
This is why senior analysts attach a power statement to every "fail to reject" outcome. At a Meta onsite, the strong answer to "the A/B came back insignificant, what do you do" is "compute the minimum detectable effect at the observed sample size and check whether that MDE is above or below the lift the product team would care about; if MDE is above the bar, the test was inconclusive, not negative". That sentence handles the asymmetry and shows you understand H0 is never accepted.
Standard formulations of H0
Most analyst interview questions reuse one of five patterns. Memorizing the standard H0 and H1 saves thirty seconds of fumbling on stage.
Two-sided A/B H0: CR_A = CR_B H1: CR_A != CR_B
One-sided A/B H0: CR_B <= CR_A H1: CR_B > CR_A
Benchmark comparison H0: mean = mu_0 H1: mean != mu_0
Independence test H0: X and Y independent H1: X and Y associated
Model versus random H0: AUC = 0.5 H1: AUC > 0.5H0 always carries the equality and H1 always carries the strict inequality or negation. If your H0 reads "the new design is better", you have inverted the framework, because "better" is the claim you are trying to prove rather than the position you defend. Interviewers count this exact mistake on the rubric at Amazon and Stripe analyst loops.
One-sided vs two-sided
A two-sided test says "any difference counts". A one-sided test says "only differences in one direction count". The choice changes the rejection region and the required sample size, and that is where interviewers probe.
Two-sided is the default for product launches because a regression in the opposite direction is information you need. If you ship a checkout redesign at DoorDash with a one-sided H1 of "conversion goes up", you forfeit the right to detect a tank in conversion with the same test. One-sided is defensible only when the opposite direction is physically impossible or operationally uninteresting. The safe interview answer is "default to two-sided unless the team pre-registered a one-sided hypothesis with a written justification".
Type I and Type II errors
The Type I and Type II grid is the second-most common follow-up after "what is H0".
H0 is actually true H0 is actually false
Rejected H0 Type I (alpha, FP) Correct decision
Did not reject H0 Correct decision Type II (beta, FN)Alpha is the probability of rejecting a true H0, conventionally 0.05. Beta is the probability of failing to reject a false H0, conventionally 0.2, which gives statistical power of 1 - beta = 0.8. The choice of both is a business trade-off. Lowering alpha to 0.01 protects against false launches at the cost of more sample. Lowering beta to 0.1 raises power to 0.9 at the same cost. Senior candidates discuss these trade-offs in dollars and weeks of data collection, not in abstractions.
Alpha is only meaningful relative to a stated H0. "Significant at alpha 0.05" has no content until the reader knows which null was tested. That is why staff analyst rubrics at Linear and Vercel start with "did the candidate state H0 explicitly before running the test".
p-value and H0 in Python
The p-value is "the probability of seeing data at least as extreme as observed, assuming H0 is true". The conditional clause "assuming H0 is true" is the part candidates skip and the part interviewers grade. A p-value is not the probability that H0 is true and it is not the probability that the result is due to chance. It is a conditional tail probability defined under the null distribution, full stop.
Here is the smallest end-to-end demonstration for a two-sample test.
import numpy as np
from scipy import stats
rng = np.random.default_rng(42)
control = rng.binomial(1, 0.10, size=8000)
treatment = rng.binomial(1, 0.105, size=8000)
# H0: CR_control == CR_treatment
# H1: CR_control != CR_treatment
stat, p = stats.ttest_ind(control, treatment, equal_var=False)
print(f"t = {stat:.3f}, p = {p:.4f}")
alpha = 0.05
print("reject H0" if p < alpha else "fail to reject H0")Running this block and reading out the decision is the whiteboard motion that senior analyst loops at OpenAI and Anthropic look for. The seed is fixed so the result is reproducible. H0 and H1 sit as comments next to the call, which is the discipline that separates analysts who defend a launch from analysts who paste code.
Common pitfalls
The first pitfall is the "we accepted H0" wording. Replace it forever with "we failed to reject H0". The replacement is not stylistic; it marks that you understand the asymmetric logic of falsification. A hiring manager who hears "we accepted the null" adjusts the candidate down a level on the rubric, because the wording leaks a misunderstanding of how the test works.
The second pitfall is putting direction inside H0. Junior candidates write H0: new design is better, which is not falsifiable because "better" is the claim you want to support, not the position you defend. The correct one-sided framing places the equality and bounding inequality inside H0 and the strict inequality inside H1, for example H0: CR_new <= CR_old and H1: CR_new > CR_old. The grader watches for this inversion on the rubric.
The third pitfall is treating p < 0.05 as automatic permission to ship. Statistical significance is necessary but not sufficient. Three things must accompany the rejection of H0 before launch: a guardrail check on counter-metrics, a multiple-testing correction if more than one hypothesis was tested in the same family, and a practical-significance check that the effect size is worth the engineering cost.
The fourth pitfall is forgetting to pre-register alpha and the test family. Picking alpha after seeing the data, or running twenty post-hoc tests and reporting only the smallest p-value, are HARKing-style behaviors that invalidate the procedure. Pre-registration in a shared doc before the experiment starts is the discipline that survives an audit. At Meta and Airbnb, experimentation platforms enforce it by tooling.
The fifth pitfall is mixing up P(data | H0) and P(H0 | data). The p-value is the first; the Bayesian posterior over H0 is the second. They are not equal and not proportional in general, and confusing them is the same category error as confusing sensitivity with positive predictive value. When the interviewer asks "what is the probability H0 is true given p equals 0.03", the correct response is "p-values do not answer that question; to get P(H0 | data) you need a prior and Bayes' theorem".
Related reading
- Bayes theorem explained simply
- A/B testing peeking mistake
- Confidence intervals data science interview
- How to calculate chi-square test in SQL
- How to design an A/B test step by step
- Why run an A/A test in A/B testing
To drill hypothesis-testing questions like this every day, NAILDD is launching with 500+ analytics problems across exactly this pattern.
FAQ
Can you prove H0?
No. The machinery can only reject H0 or fail to reject. Failing to reject is not proof of the null; it is a statement that the data did not give enough signal to dislodge the default. The strongest interview answer pairs every "fail to reject" outcome with a power calculation, because power tells you whether the negative result is informative or simply underpowered. Without that pairing, "no effect" and "we did not look hard enough" are indistinguishable.
What does H0 mean in plain English?
H0 is the default position the test starts from, almost always phrased as "no effect" or "the groups are the same". The experiment's job is to gather enough evidence to make that default untenable at a pre-registered significance level. If you cannot, the default stands, not because it is correct but because it has not been overturned.
Why is H0 always "no effect" rather than "effect exists"?
Because absence claims are easier to falsify. A single counter-example breaks "no effect", while no amount of data fully confirms "effect exists" without an effect size and a power statement. Karl Popper's falsifiability principle is the scaffolding behind the convention.
How do you choose alpha?
The default is 0.05, inherited from Fisher's 1925 writing, but align alpha with the cost of a false positive. High-stakes medical or financial decisions often use 0.01 to lower Type I risk. Fast-moving product experiments sometimes use 0.10 when the cost of a false launch is small and reversible. The interview-grade answer names the trade-off and connects it to the business context.
What is the difference between statistical and practical significance?
Statistical significance asks whether the observed effect is unlikely under H0. Practical significance asks whether the effect is large enough to act on. A test on ten million users can return p = 0.001 for a 0.05 percent lift in click-through rate, which is statistically significant but practically meaningless. Senior analysts report both numbers and let product partners decide whether the lift clears the practical bar.
How does H0 connect to Bayesian A/B testing?
Frequentist tests reject or fail to reject a fixed H0 using a p-value defined under the null. Bayesian tests skip the null entirely and report a posterior over the effect, from which you read P(B > A) directly. The Bayesian version sidesteps the peeking problem and communicates better to non-technical stakeholders, at the cost of defending a prior. Mentioning both in an A/B interview signals that you understand the trade-offs rather than parroting one school.