Statistics for data analysts: the interview minimum

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why analysts need statistics

Statistics is the language analysts use to justify decisions. Without it, you can only describe what happened. With it, you can argue that a change actually worked, quantify how confident you are, and translate noisy data into a defensible recommendation. The PMs and engineers around you will not be convinced by a dashboard screenshot — they want to know whether the lift is real or whether you are about to ship a regression.

The job-relevant surface area is narrower than the average textbook implies. Analysts at Google, Stripe, Airbnb, and Notion lean on the same handful of ideas: descriptive statistics, a few distributions, two or three hypothesis tests, confidence intervals, and a healthy skepticism about causation. The four places this shows up at work are A/B testing, anomaly review, forecasting, and segmentation — and every senior analyst loop at Meta, Uber, or DoorDash touches at least three.

Descriptive statistics

If you cannot describe a distribution out loud in one minute, the panel will not trust you with an A/B test in week three.

Central tendency comes in three flavors. The mean is sensitive to outliers — one whale customer can drag your AOV up by 30 percent. The median is the middle value of a sorted series and is robust to outliers. The mode is the most frequent value and matters more for categorical variables. A classic prompt: "AOV is $185, median is $62, what does that tell you?" The distribution is heavily right-skewed; a small share of large orders drags the mean upward, so the median is the more useful number for product decisions.

Dispersion matters just as much. Variance is the average squared deviation from the mean. Standard deviation is the square root of variance, in the same units as the data. Quartiles and IQR — Q1, Q2, Q3, IQR = Q3 − Q1 — give you a non-parametric view of spread. Percentiles like P90, P95, P99 are the bread and butter of latency and AOV analysis. When the SRE team says "tail latency," they mean P99, which behaves very differently from the average.

SELECT
  AVG(order_value)                       AS mean_aov,
  APPROX_PERCENTILE(order_value, 0.5)    AS median_aov,
  APPROX_PERCENTILE(order_value, 0.95)   AS p95_aov,
  STDDEV_SAMP(order_value)               AS sd_aov,
  APPROX_PERCENTILE(order_value, 0.75)
    - APPROX_PERCENTILE(order_value, 0.25) AS iqr_aov
FROM orders
WHERE order_date >= DATE '2026-04-01';

For a deeper dive on the median, see median explained simply. For dispersion, variance and standard deviation walks through the formulas.

Distributions you must know

Four distributions cover roughly 90 percent of analyst work.

The normal distribution is the bell curve, parameterized by mean μ and standard deviation σ. About 68 percent of data falls within μ ± σ, 95 percent within μ ± 2σ, and 99.7 percent within μ ± 3σ. Most parametric tests assume either the data or the sampling distribution of the mean is normal. See normal distribution explained simply.

The central limit theorem (CLT) is why parametric tests work on data that is clearly not normal. The sampling distribution of the mean approaches normal as sample size grows, regardless of the underlying shape. In practice n above 30 is the rule of thumb; for heavily skewed data like revenue, you want more.

The binomial distribution counts successes in n independent trials with probability p. This is the model for conversion: 1,000 visitors land on pricing, 32 upgrade, n=1,000 and p=0.032. See the binomial distribution guide.

The Poisson distribution counts events in a fixed window: orders per hour, errors per day, tickets per week. Mean equals variance for a Poisson — a quick sanity check; if variance is much larger, the process is over-dispersed and a negative binomial may fit better.

Hypothesis testing

The five-step recipe is non-negotiable and should roll off your tongue.

  1. State the hypotheses. H0 is the null — no effect. H1 is the alternative. Be precise about one-sided vs two-sided.
  2. Pick the test. Two-proportion z-test for binary outcomes like conversion. T-test for continuous outcomes like AOV or session length. Chi-square for contingency tables. Mann-Whitney if data is heavily skewed and the sample is small.
  3. Set alpha. Default 0.05; lower to 0.01 with many comparisons or when a false positive is expensive.
  4. Compute the test statistic and the p-value.
  5. Decide. If p is less than alpha, reject H0; otherwise fail to reject. "Fail to reject" is not "accept."

A worked two-proportion z-test:

import numpy as np
from scipy import stats

# Variant A: 320 / 10,000. Variant B: 350 / 10,000.
x = np.array([320, 350])
n = np.array([10_000, 10_000])

p_pool = x.sum() / n.sum()
se = np.sqrt(p_pool * (1 - p_pool) * (1 / n[0] + 1 / n[1]))
z = (x[1] / n[1] - x[0] / n[0]) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))

print(f"z = {z:.3f}, p = {p_value:.4f}")

For continuous metrics use Welch's t-test, which does not assume equal variances. See t-test vs z-test for statistics for the decision tree and t-tests for the data science interview for the assumptions panels probe.

P-value is the probability of observing data at least as extreme as what you saw, assuming H0 is true. It is not the probability that H0 is true. Conflating these is the single most common mistake in stats interviews. Full breakdown in p-value explained simply.

Confidence intervals

A 95 percent CI is the range of plausible values for the true parameter such that, repeating the experiment many times, 95 percent of constructed intervals would contain the true value. The construction is about the procedure, not the specific interval.

In practice, analysts should reach for CIs more often than p-values. A 95 percent CI on conversion of [3.1 percent, 3.9 percent] tells your PM that the metric is non-zero and that the precision is around four-tenths of a percentage point. When stakeholders ask "by how much," CIs answer; p-values do not.

from statsmodels.stats.proportion import proportion_confint

ci_low, ci_high = proportion_confint(count=350, nobs=10_000, alpha=0.05, method="wilson")
print(f"95% CI for conversion: [{ci_low:.4f}, {ci_high:.4f}]")

The Wilson interval behaves better than the textbook normal approximation when proportions are near 0 or 1, which is most A/B tests in real life. See confidence intervals for the data science interview.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Correlation vs causation

Correlation measures the strength of a linear relationship. Pearson's r ranges from -1 to +1. Causation means one variable actually drives the other. Confusing them wrecks analyst credibility.

The canonical example is that ice-cream sales correlate with drowning deaths; the cause is hot weather. In your day job confounders are subtler: a feature flag correlated with retention may be correlated only because power users were rolled out first. To establish causation you need randomization (A/B test) or a quasi-experimental design like difference-in-differences, instrumental variables, or synthetic control. Correlation explained simply covers Pearson vs Spearman.

Common pitfalls

Analysts who have done one or two A/B tests fall into the same traps experienced analysts have learned to avoid. Walking through them in advance saves you from learning them with a live launch.

The first trap is treating p-value as the probability that the null is true. P-value is a property of the data conditional on the null, not the other way around. This sounds pedantic until your PM asks "so there's a 5 percent chance the feature doesn't work" and you have to gently correct them in a leadership review.

The second trap is peeking at A/B tests before they reach the planned sample size. Each peek inflates your type-I error rate, so a test designed for alpha equals 0.05 may actually run at 15 to 30 percent if you check it daily and ship the first time it crosses the threshold. The fix is to predefine sample size, peek only at guardrails, and use sequential testing methods like always-valid p-values when you genuinely need early stopping. Full breakdown in the peeking mistake in A/B testing.

The third trap is reporting only statistical significance and ignoring effect size. A test with 10 million observations will flag a 0.01pp lift as significant; that does not mean it is worth the engineering cost. Always pair p-values with effect size. Senior analysts lead with effect size and treat significance as a hygiene check.

The fourth trap is many comparisons without correcting for multiple testing. If you test 20 segments at alpha = 0.05, you expect one false positive by chance. The fix is Bonferroni for a few comparisons or Benjamini-Hochberg for many. Without correction, slice-and-dice analysis produces spurious wins that fail to replicate.

The fifth trap is assuming normality on small samples without checking. The CLT helps once n is large, but for n under 30 with heavily skewed data — revenue is the classic offender — your t-test may be invalid. Bootstrap or a non-parametric test like Mann-Whitney is safer.

What gets asked in interviews

Across loops at Google, Meta, Amazon, Stripe, Airbnb, DoorDash, and Snowflake, the top statistics questions for analyst roles cover almost the same ground:

  1. What is a p-value, in your own words?
  2. What is the difference between type I and type II errors?
  3. How do you compute sample size for an A/B test?
  4. What is a confidence interval, and how is it different from a credible interval?
  5. When do you use a t-test vs a z-test vs a chi-square test?
  6. Explain the central limit theorem and why it matters.
  7. How do you check whether data is normally distributed, and what do you do if it isn't?
  8. What is the multiple comparisons problem and how do you correct for it?
  9. How do you tell correlation from causation, and how would you design a study to establish it?
  10. What is statistical power and how does it relate to sample size and effect size?

A typical case prompt: "We ran an A/B test, p is 0.03, conversion rose from 2.0 to 2.1 percent. What do you recommend?" The strong answer notes the lift is significant but small (5 percent relative, 0.1pp absolute), weighs business value against engineering cost, checks guardrails, and inspects whether the effect is stable across segments and over time. The weak answer is "p < 0.05, ship it."

A four-week study plan

Week one is descriptive statistics. Lock in mean, median, mode, standard deviation, quartiles, percentiles, and box plots. Practice describing a distribution out loud in under a minute given summary stats.

Week two is hypothesis testing. Drill H0, H1, alpha, p-value, type I and II errors, z-test, t-test, chi-square. Translate any business question into a null and alternative within thirty seconds.

Week three is confidence intervals and A/B design. Construct CIs for proportions and means, compute sample size given baseline rate, minimum detectable effect, alpha, and power, and walk through one end-to-end design including guardrails and segments.

Week four is the advanced layer: multiple comparisons (Bonferroni, Benjamini-Hochberg), correlation vs causation, an overview of Bayesian A/B testing, and variance reduction methods like CUPED. You do not need to implement them from scratch — you need to know when to reach for them.

If you want to drill these every day against a real interviewer, NAILDD is launching with 500+ SQL and statistics problems in exactly this pattern.

FAQ

Does a data analyst need advanced mathematics?

No. The working surface for the job and interviews is descriptive statistics, a few distributions, hypothesis testing, and confidence intervals. Linear algebra and multivariable calculus matter for data scientists and ML engineers who train models from scratch, but they are essentially never tested in an analyst loop. Spend prep time on intuition and applied problems rather than proofs.

What's the minimum for a junior analyst role?

Mean, median, standard deviation, and a solid grasp of the normal distribution will get you through a junior screen. For mid-level, add hypothesis testing, p-values, confidence intervals, and basic A/B design. For senior, layer on multiple comparisons, variance reduction, quasi-experimental methods, and the language to discuss tradeoffs with PMs.

Should I use Python or Excel for statistics?

Both, depending on context. Excel and Google Sheets are fine for quick scratch work with T.TEST and NORM.DIST — stakeholders can open the file and follow along. Python with scipy.stats, statsmodels, and pandas is the right tool for any serious analysis. In interviews, panels care about the choice of test, not whether you can recite the scipy signature.

Which books should I read?

For an intuitive start, Naked Statistics by Charles Wheelan. For a textbook with worked examples, Statistics by Freedman, Pisani, and Purves. For A/B testing specifically, Trustworthy Online Controlled Experiments by Kohavi, Tang, and Xu is the modern reference. Skim, then do problems — reading without exercises does not build interview fluency.

What if I freeze on a stats question?

Verbalize the recipe out loud. "Let me state the hypotheses first, then pick a test, then think about assumptions, then compute." Panels reward a structured approach over a fast wrong answer. If you do not know a formula, say so, then describe what the answer should look like — a probability between 0 and 1, an interval centered on the point estimate, a test statistic that grows with effect size and shrinks with variance. Showing the shape of the answer earns most of the credit.