T-test vs chi-square
Contents:
The 30-second answer
A t-test compares means on continuous data — average order value, session duration, revenue per user, page load time. Chi-square compares proportions or frequencies on categorical data — converted vs not converted, free vs paid plan, clicked vs ignored. The choice is dictated by the measurement scale of your metric, not by sample size or personal preference.
If the interviewer hands you a number that lives on a continuous axis and you care about the average, reach for Welch's t-test. If the interviewer hands you a count of "how many out of N did X", reach for chi-square or its sibling, the two-proportion z-test. Mixing them up is one of the fastest ways to lose a data science loop — it signals you don't understand the statistic you're actually testing.
When to use a t-test
Use a t-test when your metric is a number you can take a mean of and the question is "are the means different between groups". Average order value, time-on-site, retention days, latency at the median, revenue per active user — all continuous, all fair game for a t-test.
There are three flavors you should know cold. One-sample asks whether a single group's mean differs from a fixed reference (e.g., "is our average checkout time slower than the 4-second SLA?"). Two-sample independent asks whether two groups drawn from different populations have different means (e.g., control vs treatment in an A/B test). Paired asks whether the same units changed (e.g., revenue per user before and after a UI change, with each user contributing two observations).
In Python, the workhorse is scipy.stats.ttest_ind with equal_var=False. That switch turns on Welch's t-test, which does not assume the two groups have equal variance. A/B treatments often inflate variance — a discount that converts more people also produces more $0 sessions for non-converters — so Welch's is the safer default.
from scipy.stats import ttest_ind
stat, p = ttest_ind(control, test, equal_var=False) # Welch's
if p < 0.05:
print('significant')The classic mistake on assumptions is worrying about normality when you have a large sample. By the central limit theorem, the sampling distribution of the mean is approximately normal once N is in the hundreds, even if individual observations are heavily skewed. What you should actually worry about is independence of observations and presence of extreme outliers that drag the mean around.
When to use chi-square
Use chi-square when your metric is a count of how many units fall into each category. Conversion rate (converted vs not), plan distribution (free vs pro vs enterprise), click vs no-click, churned vs retained, gender by product preference — all categorical, all chi-square territory.
There are two flavors. Goodness of fit asks whether one observed distribution matches an expected one ("do weekday signups follow the historical split, or has Monday gotten heavier?"). Test of independence asks whether two categorical variables are related ("is plan tier independent of acquisition channel?"). In A/B testing, you'll use the independence flavor 95% of the time, with the two variables being "group" (control/test) and "outcome" (converted/not).
The Python call is chi2_contingency from scipy.stats, fed a 2D table of counts. Each row is a group, each column is an outcome, and every cell is a raw count — not a rate.
from scipy.stats import chi2_contingency
# contingency table
# converted not_converted
# control 100 900
# test 120 880
table = [[100, 900], [120, 880]]
chi2, p, dof, expected = chi2_contingency(table)
if p < 0.05:
print('significant difference')The trap here is that chi-square needs reasonably large expected counts in every cell — at least five per cell. For a rare event like a 0.5% conversion rate with N=300 per arm, chi-square's approximation breaks down and you should switch to Fisher's exact test.
A/B test on conversion rate, worked end-to-end
Conversion rate is the canonical interview scenario where candidates either nail the choice in one sentence or wobble. The metric is a proportion: converted users divided by total users. There are two legitimate approaches, and one popular wrong answer.
Approach 1 — chi-square on a 2x2 contingency table. Build the table of converted/not-converted by control/test, hand it to chi2_contingency, read the p-value. Works whenever expected cell counts are at least five.
table = [
[converted_control, not_converted_control],
[converted_test, not_converted_test]
]
chi2_contingency(table)Approach 2 — two-proportion z-test. For a 2x2 table with reasonably large N, chi-square equals z-squared and the p-values match exactly. statsmodels exposes it via proportions_ztest, which is convenient when you also want a confidence interval on the difference of proportions.
from statsmodels.stats.proportion import proportions_ztest
count = [converted_control, converted_test]
nobs = [n_control, n_test]
stat, p = proportions_ztest(count, nobs)The two approaches are mathematically equivalent for the 2x2 case, so use whichever your stats library makes easier. Most A/B testing platforms — both in-house at Meta, Airbnb, Netflix and off-the-shelf like Statsig or Eppo — report the z-test version because it pairs naturally with a confidence interval on the lift.
The wrong answer — t-test on the 0/1 indicator. You can technically run Welch's t-test on the binary outcome (1 if converted, 0 if not). For large N the result is numerically close to the z-test, but this is the answer that makes interviewers wince: you're using a continuous-data tool on a Bernoulli outcome, and the standard error formula isn't quite right for small samples.
What about average revenue per user? Revenue per user is continuous (a sum of money, not a category), so you need a t-test — or, if revenue is heavily skewed by a few whales, a bootstrap confidence interval on the mean lift. Running chi-square on continuous revenue, or t-test on a conversion indicator, is the most common error in junior A/B test writeups.
Side-by-side comparison
| T-test | Chi-square | |
|---|---|---|
| Data type | continuous (real numbers) | categorical (counts in cells) |
| What you compare | means | proportions / frequencies |
| Test statistic | t | chi-squared |
| Example metrics | AOV, session time, latency | CR, click rate, plan mix |
| Sample size | works fine at modest N | needs ≥5 expected per cell |
| Default in Python | ttest_ind(equal_var=False) |
chi2_contingency(table) |
| Effect size companion | Cohen's d | Cramér's V or risk ratio |
The "sample size" row is the one candidates miss most often. T-tests can run on N=30 per group if the effect size is large. Chi-square on a 2x2 with N=30 per group is fine for a 30% conversion rate but falls apart for 1%, because expected conversions per cell drop near zero. The right diagnostic is always expected cell counts, not row totals.
When neither test is the right tool
Heavily skewed continuous data — long-tail revenue, latency, sessions dominated by marathon users — punishes the t-test's reliance on the mean. Mann-Whitney U compares the ranks of two groups and is the standard switch when distributions are visibly skewed. For multiple groups, Kruskal-Wallis extends it.
Ordinal data — survey responses on a 1-5 scale, severity levels — is neither truly continuous nor truly categorical (the order matters). Treat it with Mann-Whitney, or justify in writing why you're treating the scale as interval.
Paired proportions — same users, binary outcome before vs after — break chi-square's independence assumption. The correct tool is McNemar's test, which uses only the discordant pairs (users who switched outcomes). This shows up in retention experiments and before/after consent flow changes.
Small expected counts — anything below five in a chi-square cell — call for Fisher's exact test. It computes the exact probability of the observed table without leaning on the chi-square approximation. Use scipy.stats.fisher_exact.
Interview questions you will actually get
"T-test or chi-square for conversion rate?" Chi-square (or equivalently the two-proportion z-test). Conversion rate is a proportion. T-test belongs to AOV, session length, revenue per user — anything you'd compute with AVG() in SQL.
"What are the assumptions of the t-test?" Independence of observations, approximate normality of the sampling distribution of the mean (basically free with N>30 by CLT), and — for Student's — equal variances. Welch's drops the equal-variance requirement and is the safer default.
"When does chi-square stop working?" When expected cell counts drop below five. Run Fisher's exact instead. Also watch out for non-independent observations — same user counted twice across cells inflates significance.
"Can I use a t-test on a 0/1 indicator?" Numerically similar to the z-test for large N, but the wrong framing. The variance of a Bernoulli is p(1-p), not the sample variance the t-test computes. Senior interviewers will dock you for this.
"You ran chi-square and got p=0.04. What do you tell the PM?" Only that you have evidence against the null if the experiment had a pre-registered sample size and a single primary metric. If you peeked five times and stopped on the first significant result, p=0.04 is not really 4%. See the A/B testing peeking mistake for the full story.
Common pitfalls
When teams first run statistical tests on experiment data, the most frequent error is reaching for a t-test on conversion rate because "we have a lot of users, CLT covers us". The math works in the limit, but you're throwing away the natural structure of the problem — a Bernoulli variable with a known variance formula — and risking a miscalibrated standard error. Chi-square or the z-test gives the same answer with cleaner reasoning, and reasoning is what the interviewer grades.
Another trap is running Student's t-test instead of Welch's. Student's assumes equal variance, which is almost never true in A/B testing because the treatment usually moves both the mean and the spread. Welch's drops that assumption at zero cost — same call, just equal_var=False. Reach for Student's only when you've verified equal variance, not the other way around.
Ignoring small expected cell counts in chi-square is the third frequent mistake. The chi-square approximation needs roughly five expected per cell, and below that it bends. The fix is Fisher's exact test. Don't argue with this in the interview — just say "small counts, Fisher's exact" and move on.
Running many tests without a multiple-testing correction is what separates juniors from seniors. If you check ten metrics at alpha=0.05, the chance of at least one false positive is over 40%. The basic fix is Bonferroni — divide alpha by the number of tests — though it's conservative. Benjamini-Hochberg's false discovery rate is the workhorse for analytics teams running large experiment portfolios.
Finally, watch out for hidden non-independence. If your unit of randomization is user but you're testing on sessions, you've inflated your effective sample size and your p-values are too small. The fix is to aggregate to user-level first, or use a method that accounts for clustering (delta method, bootstrap by user, or cluster-robust standard errors).
Related reading
- A/B testing peeking mistake
- How to calculate chi-square test in SQL
- P-value explained simply
- Effect size explained simply
- Bootstrap explained simply
If you want to drill experiment-design and stats questions like this every day, NAILDD is launching with 500+ data science problems built around exactly this kind of choice.
FAQ
What is the difference between t-test and chi-square in one sentence?
T-test compares means on continuous data (revenue, latency, time-on-site), chi-square compares proportions or frequencies on categorical data (converted vs not, plan tier, click vs no click). The test you pick is determined entirely by the type of metric, not by sample size or company convention.
Which test do I use for A/B testing conversion rate?
Chi-square on the 2x2 contingency table of group vs converted, or equivalently the two-proportion z-test. For the 2x2 case they are mathematically equivalent and produce identical p-values, so use whichever your stack exposes more cleanly. Avoid running a t-test on the 0/1 conversion indicator — it works in the limit but signals weak fundamentals.
Can I use a t-test if my data is not normally distributed?
For sample sizes above a few hundred per group, yes — the CLT says the sampling distribution of the mean is approximately normal even when individual observations are skewed. For smaller samples or heavily skewed data (revenue with a few whales, latency with a long tail), prefer Mann-Whitney U or a bootstrap confidence interval on the mean.
What is the difference between Student's and Welch's t-test?
Student's assumes the two groups have equal variance, Welch's does not. In A/B testing, treatments usually shift both the mean and the spread, so equal-variance is fragile. Welch's costs almost nothing — same call, just equal_var=False — and should be your default.
When should I use Fisher's exact test instead of chi-square?
When any cell has an expected count below five. Below that threshold, the chi-square approximation isn't reliable. Fisher's exact computes the exact probability under the null and works on any 2x2 regardless of sample size, making it the safer fallback when conversions are rare.
What about paired proportions, like before-and-after on the same users?
Use McNemar's test, not chi-square. McNemar's looks only at the discordant pairs — users who flipped outcome — and accounts for the dependency between two measurements on the same person. Running regular chi-square on paired data underestimates the standard error and inflates significance.