Variance and standard deviation
Contents:
What variance actually measures
Your PM at a Stripe-style payments team pings you on a Friday afternoon: "Two checkout variants have the same conversion rate, but the team swears variant B feels different." If you only report means, you miss the story. Variance is the number that catches what the average hides — how stretched out values are around the center.
Formally, variance is the average of squared deviations from the mean:
Var(X) = Σ(xᵢ - x̄)² / NWe square the deviations for two reasons. Positive and negative gaps from the mean would otherwise cancel out and sum to zero. Squaring also penalizes large deviations more heavily, which matches the intuition that an outlier should not look the same as a small wobble.
Worked example. Five candidates finish a take-home in 10, 12, 14, 16, and 18 minutes.
- Mean: (10 + 12 + 14 + 16 + 18) / 5 = 14
- Deviations from the mean: -4, -2, 0, 2, 4
- Squared deviations: 16, 4, 0, 4, 16
- Population variance: (16 + 4 + 0 + 4 + 16) / 5 = 8
The variance is 8. The unit is minutes squared, which is awkward to talk about in a standup. That is exactly where standard deviation comes in.
Standard deviation in the same units as your data
Standard deviation (often called std, sigma, or σ) is just the square root of variance:
σ = √Var(X)For the example above, σ = √8 ≈ 2.83 minutes. Now the unit lines up with the data, and you can say something interpretable like "completion time wobbles around the mean by about 2.8 minutes on average."
Standard deviation is the workhorse of analytics. It shows up everywhere — inside confidence intervals, inside z-scores, inside the power formula for A/B tests. If you only had room for one spread statistic on a dashboard, σ is the right pick because it preserves the original unit.
Population vs sample: why N-1
This is the single most common interview tripwire on variance, so it is worth getting right.
| Population | Sample | |
|---|---|---|
| Divisor | N | N - 1 |
| Symbol | σ² | s² |
| When to use | You have the entire universe of data | You only have a sample of it |
The sample variance formula is:
s² = Σ(xᵢ - x̄)² / (N - 1)For the same five candidates: s² = 40 / 4 = 10, so s = √10 ≈ 3.16 minutes.
Why N - 1, also known as Bessel's correction. When you compute the sample mean x̄ from the data itself, that mean sits closer to the data than the true population mean μ would. That nudges the sum of squared deviations downward on average, which biases the variance estimate downward. Dividing by N - 1 instead of N corrects exactly this bias and gives an unbiased estimator. A more intuitive framing: by spending one piece of information to estimate the mean, you have one fewer "free" data point left for estimating spread, so the divisor drops by one.
In practice, when N is in the thousands, the gap between dividing by N and dividing by N - 1 is negligible. But for small samples — say a 20-customer pilot or an early experiment — using N instead of N - 1 will visibly understate the spread, and downstream confidence intervals will be too narrow. Bessel's correction is one of those tiny details that quietly fixes a lot of bad inference.
The 68-95-99.7 rule
When data is roughly normal, standard deviation has a clean geometric interpretation:
- About 68.3% of values fall within μ ± 1σ
- About 95.4% of values fall within μ ± 2σ
- About 99.7% of values fall within μ ± 3σ
Suppose a DoorDash analyst sees an average basket of $42 with σ = $6, and basket size is approximately normal. Then roughly 95% of baskets sit between $30 and $54. A $65 basket is more than 3σ from the mean, which is a candidate outlier worth a second look before you let it drive a forecast. This rule is a fast sanity check when you do not yet want to fire up a full distributional test.
Why analysts care about variance
A/B test power. The minimum sample size formula has σ² in the numerator: n ∝ σ² / δ². Higher variance means you need more observations to separate signal from noise. This is the structural reason revenue tests at Airbnb or Uber require huge traffic — basket size has a heavy right tail, so variance is large, so the minimum detectable effect crashes unless you collect a lot of data.
Confidence interval width. The CI formula is x̄ ± z · (σ / √n). Standard deviation sets the margin of error directly. Double σ and you double the interval width at the same sample size. To tighten, gather more data, reduce variance with a technique like CUPED, or accept lower confidence.
Comparing distributions. Two datasets can share an average and describe different realities. A latency dashboard with mean 100ms and σ = 5ms feels stable. The same mean with σ = 50ms feels broken. Reporting only the mean is half the picture.
SQL: VAR_SAMP and STDDEV_SAMP
PostgreSQL, Snowflake, BigQuery, and Databricks all ship native aggregates for variance and standard deviation:
SELECT
topic,
AVG(score) AS avg_score,
VAR_SAMP(score) AS variance, -- sample variance, divides BY N-1
STDDEV_SAMP(score) AS std_dev, -- sample standard deviation
VAR_POP(score) AS var_pop, -- population variance, divides BY N
STDDEV_POP(score) AS std_pop -- population standard deviation
FROM training_events
GROUP BY topic;VAR_SAMP and STDDEV_SAMP divide by N - 1; VAR_POP and STDDEV_POP divide by N. You almost always have a sample, so the _SAMP variants are the safer default. Being explicit also avoids surprises during code review.
For grouped comparisons, this pattern is handy when you want a quick spread-vs-mean table per segment:
SELECT
country,
COUNT(*) AS n_orders,
AVG(order_value_usd) AS mean_order,
STDDEV_SAMP(order_value_usd) AS std_order,
STDDEV_SAMP(order_value_usd)
/ NULLIF(AVG(order_value_usd), 0) AS coefficient_of_variation
FROM orders
WHERE order_date >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY country
ORDER BY n_orders DESC;The coefficient of variation (std divided by mean) is unitless and lets you compare spread across countries with very different price levels.
Python: numpy and pandas
import numpy as np
import pandas as pd
data = [10, 12, 14, 16, 18]
# numpy: ddof=0 by default (population)
print(np.var(data)) # 8.0
print(np.std(data)) # 2.83
# Sample variance: pass ddof=1
print(np.var(data, ddof=1)) # 10.0
print(np.std(data, ddof=1)) # 3.16
# pandas: ddof=1 by default (sample)
s = pd.Series(data)
print(s.var()) # 10.0
print(s.std()) # 3.16The classic gotcha: numpy and pandas pick different defaults for ddof. np.std() divides by N (ddof=0), while Series.std() divides by N - 1 (ddof=1). Forgetting this is a frequent reason a notebook number does not match a dashboard number. Make it a habit to pass ddof explicitly any time spread matters, and your future self will thank you during a stakeholder grilling.
Common pitfalls
The first trap is confusing population variance and sample variance. In practice you almost always have a sample — a slice of users, sessions, or orders — so dividing by N - 1 is the right call. Reach for VAR_POP only when you really do hold the entire universe of data, which is rarer than analysts assume.
The second trap is ignoring variance during A/B test reviews. A readout that says "conversion went up 0.5%" without any spread statistic does not tell you whether to ship. With high variance and a small sample, that 0.5% lift can easily be noise. Always pair the point estimate with a confidence interval or a p-value, both of which already encode the variance for you.
The third trap is forgetting ddof in Python. np.std() and pd.Series.std() give different answers on the same data, and the gap grows on small samples. Set the parameter explicitly in code reviews and teach the convention early, before someone exports a chart that disagrees with the SQL right behind it.
The fourth trap is interpreting variance in the original units. Variance is in squared units — minutes squared, dollars squared, milliseconds squared — which is almost never what you want to discuss. When speaking to a PM or leadership, convert to standard deviation so the unit matches the metric.
The fifth trap is treating high variance as a problem to remove rather than a property to engineer around. Sometimes the right answer is variance reduction with a technique like CUPED, and sometimes it is redesigning the metric — for example, switching from raw revenue to a capped or binarized version. Understand the source of the spread before throwing data away.
Interview questions
"Define variance and standard deviation."
Variance is the average of squared deviations from the mean and captures how spread out the data is. Standard deviation is the square root of variance and lives in the same units as the data, which is why it is preferred for interpretation. If pushed for one sentence: σ is the typical distance of a data point from the mean.
"Population vs sample variance — what is the difference?"
Population variance divides the sum of squared deviations by N; sample variance divides by N - 1. The N - 1 adjustment is Bessel's correction. It exists because the sample mean is computed from the data itself, so the sum of squared deviations is biased downward as an estimate of the true population spread. Dividing by N - 1 cancels that bias and yields an unbiased estimator.
"How does variance affect A/B tests?"
Variance drives both the minimum sample size and the width of the resulting confidence interval. The sample size formula is roughly n = (z_α + z_β)² · 2σ² / δ², where δ is the minimum detectable effect. Higher σ² inflates the required sample size, which is why revenue tests at marketplaces like Airbnb or Uber demand far more traffic than conversion tests on the same product.
"Why do np.std() and pd.Series.std() return different numbers?"
Different defaults for ddof. NumPy defaults to ddof=0 and divides by N. Pandas defaults to ddof=1 and divides by N - 1. To keep results consistent across notebooks and dashboards, always pass ddof explicitly and document the convention in your repo.
"How would you reduce variance in an A/B test metric?"
A handful of standard moves. CUPED adjusts for pre-experiment covariates and can cut variance by 30-50% on the right metrics. Switching the metric — moving from raw revenue to bounded or binarized versions — trades a little signal for a lot of variance reduction. Winsorizing or using robust estimators like the median helps when outliers dominate. Stratified randomization on a high-variance covariate is another lever, and the unglamorous answer is often simply running the test for longer.
Related reading
- Bootstrap explained simply
- Confidence intervals — data science interview
- CUPED explained simply
- Effect size explained simply
- A/B testing peeking mistake
If you want to drill variance, std, A/B test power, and the SQL behind them every day, NAILDD is launching with 500+ analyst problems in exactly this shape.
FAQ
What is variance in plain English?
Variance is a single number that tells you how spread out a set of values is around its average. Small variance means most points cluster tightly near the mean; large variance means values are scattered. It is the average of squared distances from the mean, which is why its unit is the original unit squared.
Why is standard deviation preferred over variance for reporting?
Standard deviation is the square root of variance, so it lives in the same units as the data. If your metric is in dollars, σ is in dollars and you can say "orders deviate from the average by about $6," which lands in any stakeholder meeting. Variance is in dollars squared, which is hard to translate into a useful sentence.
When should I divide by N versus N - 1?
Divide by N only when you have the entire population — every customer, every event, every row that will ever exist. Divide by N - 1 when you have a sample and want an unbiased estimate of the population variance. In analytics work the sample case dominates, so N - 1 (and the _SAMP SQL functions or ddof=1 in code) should be your default.
How does variance connect to A/B testing?
Variance shows up in three places at once. It sets the minimum sample size you need to detect a target effect, the width of any confidence interval, and the p-value of any frequentist test. That is why variance reduction techniques such as CUPED or careful metric design can shrink the traffic budget for an experimentation roadmap.
Is high variance always bad?
No. High variance is a property of the data — sometimes it reflects real heterogeneity you want to preserve, like the long tail of basket sizes at a marketplace. The right response is not to delete data but to choose statistical tools that handle spread well: robust estimators, bootstrap intervals, capped metrics, or variance-reduction techniques.