How to calculate Bonferroni correction in SQL
Contents:
Why Bonferroni matters
Picture a Friday afternoon at Stripe. A PM pings on Slack: "Checkout v3 launched two weeks ago, can you confirm it won over v2 on all our north-star metrics by Monday morning?" You open the dashboard. Five variants, ten metrics, four segments the team cares about. That is two hundred individual hypothesis tests on one experiment. If you treat each one with the standard alpha of 0.05, the chance that at least one comparison comes back significant by pure chance is essentially one. You will ship a winner that does not exist, and the PM will quietly stop trusting your dashboards.
Family-wise error rate, or FWER, is the probability that any one of the comparisons in your "family" of tests is a false positive. Multiple-comparison corrections keep that family-wise rate at or below the alpha you committed to in the PRD. Bonferroni is the oldest, simplest, and most conservative member of that family. It will never let your false-positive rate exceed alpha, and it costs four lines of SQL to apply. The trade-off is power: because the correction is blunt, it can mask real wins when the number of tests grows. Every analyst should know how to compute it, when to apply it, and when to swap it for Holm or Benjamini-Hochberg.
This post walks through the math, the SQL, and the failure modes I see most often when reviewing experiment readouts. It assumes you already know what a p-value is and that you can read a CTE.
The Bonferroni formula
Suppose you ran m hypothesis tests in the same experiment family and you want the joint probability of any false positive to stay at or below alpha (typically 0.05). Bonferroni gives you two equivalent ways to enforce that bound.
The first form adjusts the threshold downward. You divide your family-wide alpha by the number of tests and compare each individual p-value against the smaller, per-test alpha:
alpha_per_test = alpha / mThe second form adjusts the p-values upward. You leave the alpha at 0.05 and multiply every p-value by m, capping at 1.0:
p_adjusted = min(p * m, 1)Both forms produce identical accept/reject decisions. The second is more useful in reports because you can show stakeholders a single column called p_bonferroni and let them apply the familiar 0.05 cutoff themselves. The first is more useful when you are pre-registering an experiment and want to lock the per-test threshold before any data lands.
The math behind it is the Bonferroni inequality. If A_i is the event that test i is a false positive, then P(union of A_i) is at most the sum of P(A_i). Each P(A_i) is at most alpha / m, so the union is at most alpha. The inequality is loose when tests are correlated, which is the source of the conservativeness everyone complains about.
Adjusted alpha in SQL
The simplest case: ten metrics in one experiment, family-wide alpha of 0.05. Compute the per-test threshold inside a CTE and join it onto your results table.
WITH config AS (
SELECT 0.05::NUMERIC AS family_alpha, 10::INT AS m_tests
)
SELECT
family_alpha,
m_tests,
family_alpha / m_tests AS bonferroni_alpha
FROM config;The result is 0.005. Every metric is now judged against alpha of 0.005, and the joint family-wise false-positive rate stays at or below five percent.
In practice you do not hard-code m. You count it from your hypothesis registry. If you keep one row per pre-registered comparison in a table called experiment_hypotheses, the count comes straight out of SQL:
WITH config AS (
SELECT
0.05::NUMERIC AS family_alpha,
(SELECT COUNT(*) FROM experiment_hypotheses WHERE experiment_id = 'checkout_v3') AS m_tests
)
SELECT family_alpha / m_tests AS bonferroni_alpha
FROM config;This approach forces the discipline of registering hypotheses up front. If a stakeholder asks for a sixth metric mid-experiment, the registry insert raises m and the cutoff tightens automatically. No silent p-hacking.
Adjusted p-values in SQL
Most reports are easier to read when the alpha is fixed and the p-values are adjusted instead. Imagine five primary metrics from a checkout test, each with a raw p-value already computed via your chi-square or two-proportion Z routines:
WITH raw_p AS (
SELECT * FROM (VALUES
('conversion', 0.003),
('revenue', 0.020),
('retention_d7', 0.045),
('time_on_page', 0.120),
('refunds', 0.500)
) AS t(metric, p_value)
),
correction AS (
SELECT COUNT(*)::NUMERIC AS m FROM raw_p
)
SELECT
r.metric,
r.p_value,
LEAST(r.p_value * c.m, 1.0) AS p_bonferroni,
LEAST(r.p_value * c.m, 1.0) < 0.05 AS significant_bonferroni
FROM raw_p r
CROSS JOIN correction c
ORDER BY r.p_value;Reading the output: conversion keeps its win after the correction because 0.003 * 5 = 0.015 remains below 0.05. revenue loses significance because 0.020 * 5 = 0.10 exceeds the threshold. Three metrics that looked promising on their own collapse into "not enough evidence" once Bonferroni accounts for the fact that you went looking five times.
The CROSS JOIN on a one-row correction table is intentional. It keeps m symbolic so you can swap five for fifty without rewriting the SELECT list, and it avoids a window function for a value that is constant across rows.
Holm-Bonferroni step-down
Holm-Bonferroni is the modern default for FWER control. It is the sequential, step-down cousin of plain Bonferroni and dominates it strictly: any metric Holm rejects, Bonferroni rejects too, plus some that Bonferroni misses. The math is to sort p-values ascending and multiply each by a shrinking factor based on its rank.
p_(1) * m, p_(2) * (m - 1), ..., p_(m) * 1The smallest p-value is still scaled by m, just like Bonferroni, so the most extreme finding is judged just as harshly. As you walk up the sorted list, the multiplier shrinks, so borderline winners get a real chance to clear the bar.
WITH raw_p AS (
SELECT * FROM (VALUES
('conversion', 0.003),
('revenue', 0.020),
('retention_d7', 0.045),
('time_on_page', 0.120),
('refunds', 0.500)
) AS t(metric, p_value)
),
ordered AS (
SELECT
metric,
p_value,
ROW_NUMBER() OVER (ORDER BY p_value) AS rnk,
COUNT(*) OVER () AS m
FROM raw_p
)
SELECT
metric,
p_value,
LEAST(p_value * (m - rnk + 1), 1.0) AS p_holm,
LEAST(p_value * (m - rnk + 1), 1.0) < 0.05 AS significant_holm
FROM ordered
ORDER BY rnk;A correctness detail Holm requires: once a test fails, every test ranked higher (with a larger raw p-value) is also non-significant. You can enforce that monotonicity in SQL by replacing p_holm with a running maximum over rank.
When you have a choice between Bonferroni and Holm, pick Holm. FWER control is identical, but Holm preserves more power.
Common pitfalls
The most common error is miscounting m. If for each metric you run both a one-sided "treatment greater than control" and a one-sided "treatment less than control" test, that is two tests per metric, not one. Five metrics tested in both directions gives m = 10. The same logic applies across segments: ten metrics on four segments is forty comparisons, not ten. Always document the unit of analysis and count every comparison the team will look at.
A second trap is applying Bonferroni to strongly correlated metrics. Conversion rate, revenue per visitor, and add-to-cart rate move together by construction. Bonferroni treats them as independent and divides alpha as if no overlap existed, which is why analysts call it "too conservative". When metrics share denominators or are aliases of the same underlying behavior, the Sidak correction or Hochberg's step-up procedure preserves more power while still controlling FWER. For very large m, Benjamini-Hochberg is usually a better fit.
A third pitfall is applying any correction post-hoc. The moment someone says "we got eight non-significant metrics, let us throw in five more and see if those move", you have left the world of pre-registered tests and entered p-hacking territory. Bonferroni gives the promised guarantee only when m is fixed before the experiment runs. Sneaking in extra hypotheses after looking at the data invalidates the correction.
A fourth mistake is conflating FWER with FDR. Bonferroni and Holm control FWER, the probability of any false positive in the entire family. False discovery rate is a different goal: the expected proportion of false positives among the metrics you flag as significant. When m is small, say five to ten, FWER methods are appropriate. When m is large, say fifty primary metrics on a feature flag rollout, FWER methods leave most real wins on the floor and FDR methods like Benjamini-Hochberg become the right tool.
A fifth gotcha is numeric casting. In Postgres the literal 1.0 is a double precision, and mixing it with a NUMERIC p-value can introduce float-rounding surprises at the boundary of the 0.05 cutoff. Cast either side to NUMERIC explicitly. The same caution applies in Snowflake, BigQuery, and Redshift when mixing decimal and float types. Consistency in numeric types pays dividends when an audit asks why a metric flipped significance between two runs.
Optimization tips
For production reporting you will run these corrections over thousands of metric-experiment pairs, not a five-row VALUES list. Compute m per experiment family once, store the result alongside each raw p-value, and let the correction become a cheap arithmetic expression on top of a materialized table.
A materialized view keyed on experiment_id is the right shape. Snowflake, Databricks, and BigQuery let you incrementally refresh such a view; Postgres requires a manual REFRESH MATERIALIZED VIEW but the cost is trivial because the input table is small. For the Holm variant, store the precomputed rnk column alongside m so the correction is a single multiplication at read time rather than a window function on every dashboard load.
Indexing is rarely the bottleneck for correction tables because they are tiny. The expensive part of an A/B pipeline is the metric computation that produces the raw p-values upstream. Profile that first. If your raw p-values come from a chi-square or two-proportion Z test in SQL, partition the underlying event tables by experiment and metric date so the test queries hit a slice rather than a scan.
One trick that pays off at scale: store both p_bonferroni and p_holm columns in the same row. Letting the dashboard pick the column is cheaper than letting the analyst pick the correction.
Related reading
- SQL window functions interview questions
- A/B testing peeking mistake
- How to calculate AUC ROC in SQL
- How to detect anomalies in SQL
If you want to drill questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
Bonferroni or Holm-Bonferroni in interviews?
Both answers are defensible, but Holm-Bonferroni is stronger because it controls FWER at the same level while preserving more power. Plain Bonferroni earns mention as the simplest case and the one you apply when the protocol requires a fixed per-test threshold known before data collection. Naming both and explaining when each fits signals that you understand the trade-off rather than memorizing a single recipe.
When does Bonferroni become too conservative?
The rule of thumb is m greater than about twenty. Past that point the per-test alpha drops below 0.0025 and the procedure rejects real effects so aggressively that you can run a winning experiment for a quarter and never call a single metric. For families that large, Benjamini-Hochberg controlling the false discovery rate is the right method. It accepts a small expected fraction of false positives among your significant metrics in exchange for vastly better power on the true positives.
Can I use Bonferroni for pairwise comparisons in A/B/C/D tests?
Yes, and this is one of the cleanest applications. For a four-arm test with six pairwise comparisons, set m = 6 and divide alpha by six, giving a per-comparison threshold of about 0.0083. You can apply the same logic for any k arms with m = k * (k - 1) / 2 pairwise comparisons. If you are only interested in comparisons against a single control arm, drop the count to m = k - 1 and tighten less.
Where do the raw p-values come from?
For binomial outcomes like conversion or click-through, compute them in SQL using a chi-square or two-proportion Z approximation. The arithmetic fits inside a CTE and you only need the inverse normal CDF. For continuous metrics like revenue or session time, the t-test or a Mann-Whitney U is more appropriate and is easier to compute in Python with scipy. Most production pipelines compute exact p-values in Python and feed them into a metric table that SQL then corrects.
FWER versus FDR, plain English?
FWER is the probability that any of your significant findings is a false positive. FDR is the expected share of false positives within your significant findings. Pick FWER when each false positive is expensive on its own, such as a launched feature that costs millions if wrong. Pick FDR when you are screening many candidates and a small fraction of mistakes is acceptable, such as a feature-flag rollout with fifty tracked metrics where only a handful will be acted on.
Does Bonferroni work for sequential or peeking analyses?
No. Bonferroni assumes a fixed set of tests at a single decision point. Sequential analysis, where you look at the data multiple times during the experiment and might stop early, needs a different family of methods: alpha-spending functions, group-sequential designs, or always-valid p-values. Applying Bonferroni to peeking is a common interview tripwire because the candidate has the right vocabulary but the wrong method.