Bias and fairness in ML for DS interviews
Contents:
Why this comes up on the loop
Fairness has stopped being a niche ethics chapter and has become a required topic on senior DS loops at Google, Meta, Stripe, and Anthropic. The trigger was a wave of public model failures — credit decisioning that approved men at twice the rate of women on identical inputs, resume screeners that penalized women's college names, healthcare risk scores that under-allocated care to Black patients at equal medical need. Every incident fed an internal post-mortem that now seeds the interview prep doc.
The interviewer is not looking for a memorized ethics speech. They want three concrete things: that you can name at least four bias types with a one-line example each, that you can write the four-fifths rule on the whiteboard without prompting, and that you understand why demographic parity and equalized odds cannot both hold unless base rates across groups are identical. A real mitigation story — re-weighting, threshold calibration, dropping a proxy feature — puts you above bar.
Skip the buzzwords. The fastest way to lose points is to say "we just removed the protected attribute" and pretend the problem is solved.
Load-bearing trick: When the interviewer asks "which fairness metric would you use?", they want you to refuse to pick one in isolation. Name the stakeholder, name the cost of a false positive vs false negative for each group, then pick.
The bias taxonomy interviewers actually probe
Most candidates rattle off six or seven bias names from a Medium article and stop. The bar at top-tier shops is to explain the mechanism that produces each bias and the detection signal in your data.
Selection bias appears when your training sample is not drawn uniformly from the population the model will serve. A loan-approval model trained only on customers who already passed a manual underwriting filter inherits that filter's blind spots. Detection signal: large distribution shift between training data and live application traffic on demographic features.
Sampling bias is the special case where some subgroups are mechanically under-represented — for example, a fraud model trained on transactions from the US East region that gets deployed globally. The signal is uneven error rates across geographies that disappear when you reweight.
Historical bias is the one that trips up the most candidates. The data is representative of reality but reality itself encodes a past injustice. Hiring data from 2005 perfectly reflects the gender mix of 2005 software engineering — and is therefore a terrible target for a 2026 model. Removing the protected feature does not help here, because dozens of proxies (school, GitHub activity dates, hobby keywords) leak the same signal.
Measurement bias shows up when the feature itself measures something different across groups. Arrests is not a clean proxy for crime committed, because policing intensity varies by neighborhood. Doctor visits is not a clean proxy for health, because access to care varies by income. Interviewers love this one — be ready with a non-criminal-justice example so you don't sound rehearsed.
Aggregation bias is the failure mode of "one model for everyone". A diabetes risk model fit on a pooled population may underfit a subgroup whose biomarker thresholds differ — the textbook example is HbA1c cutoffs varying between ethnic groups. The detection signal: residuals are systematically larger inside one segment.
Survivorship bias is the startup example: if your churn model trains only on companies that survived long enough to enter your CRM, it cannot learn from those that died in week one. Always ask where the data was cut off.
| Bias type | Mechanism | Detection signal |
|---|---|---|
| Selection | Non-uniform draw from population | Distribution shift train vs prod |
| Sampling | Subgroup under-represented | Uneven error rate by segment |
| Historical | Past injustice encoded in labels | Proxy features carry the signal |
| Measurement | Feature means different things per group | Residual analysis by subgroup |
| Aggregation | One model for heterogeneous groups | Subgroup loss > pooled loss |
| Survivorship | Failures filtered out before collection | Suspiciously good base rates |
Fairness metrics you should be able to derive
The single formula every DS candidate must reproduce cold is the four-fifths rule, also called disparate impact in the US Equal Employment Opportunity Commission's 1978 guidance:
disparate_impact = P(ŷ=1 | group=A) / P(ŷ=1 | group=B) ≥ 0.80If group A's positive prediction rate falls below 80% of group B's, the model is flagged for an adverse impact review. This is the metric plaintiffs' lawyers and regulators use, so it appears in interview prompts framed as "your model just got a legal complaint, what do you check first".
Beyond disparate impact there are three families of group fairness metrics worth knowing by name: independence (the prediction is independent of the protected attribute), separation (the prediction is independent of the protected attribute conditional on the true label), and sufficiency (the true label is independent of the protected attribute conditional on the prediction). Demographic parity belongs to the independence family. Equalized odds belongs to separation. Calibration belongs to sufficiency. Interviewers ask "which family is your metric from" to test whether you actually understand the structure or just memorized acronyms.
A worked example anchors this. Suppose your loan model approves 60% of group A and 45% of group B. Disparate impact is 0.45 / 0.60 = 0.75, below the 0.80 threshold — triggers review. Adjust thresholds so both groups approve at 55%: demographic parity holds. But if true repayment is 80% in A and 65% in B, equalized opportunity fails — you are now approving relatively more bad-risk applicants in group B, which the lender catches in P&L.
Demographic parity vs equalized odds
Demographic parity requires P(ŷ=1 | A) = P(ŷ=1 | B). In plain English: the model approves the same fraction of each group regardless of ground truth. This is what most non-technical stakeholders mean when they say "fair".
Equalized odds requires both P(ŷ=1 | y=1, A) = P(ŷ=1 | y=1, B) and P(ŷ=1 | y=0, A) = P(ŷ=1 | y=0, B). In plain English: the model has the same true positive rate and the same false positive rate in every group. This is what lenders and clinicians usually want, because it controls error cost rather than approval count.
The impossibility result you need to be able to state: if base rates differ across groups, you cannot simultaneously satisfy demographic parity, equalized odds, and calibration. This was formalized in Chouldechova 2017 and Kleinberg, Mullainathan, Raghavan 2016. The practical implication is that picking a fairness metric is a value judgment, not a technical optimization. Interviewers will push on this — the right answer is to clarify with the stakeholder which type of error matters more.
| Metric | Family | Formula | Best for |
|---|---|---|---|
| Demographic parity | Independence | Equal positive rate | Allocation under quotas |
| Equal opportunity | Separation | Equal TPR | High-stakes denials (loans, hiring) |
| Equalized odds | Separation | Equal TPR and FPR | Cost-sensitive classification |
| Calibration | Sufficiency | P(y=1 | ŷ=s) equal across groups | Risk scoring (medical, credit) |
Sanity check: If your interviewer asks for "the fairest model", say out loud that no model can be fair under every definition simultaneously, and ask which type of error the business cannot tolerate. This single move flips you from a junior to a senior signal.
Mitigation strategies in production
Mitigation tactics split cleanly into three buckets by where in the pipeline they live.
Pre-processing modifies training data before the model sees it: reweighting examples so each subgroup contributes equally to the loss, resampling to balance representation, and suppression of the protected attribute plus its strongest proxies. Tools like AIF360 (IBM) and Fairlearn (Microsoft) ship reweighing implementations out of the box.
In-processing modifies the objective. Add a fairness penalty — typically the absolute TPR difference between groups — to the loss, weighted by a hyperparameter that controls the accuracy-fairness trade-off. Adversarial debiasing is the sophisticated version: train a second network to predict the protected attribute from the model's representation, with a gradient-reversal term so the main model learns features the adversary cannot decode.
Post-processing keeps the model fixed and adjusts thresholds per group. Hardt, Price, Srebro 2016 showed that under equalized odds you can always find group-specific thresholds achieving the constraint at minimum accuracy cost. The downside: per-group thresholds are visible and create legal exposure, since you are explicitly using the protected attribute at inference.
Pick the bucket that matches your constraints. Vendor model or legacy system → post-processing. Cannot expose protected attribute at inference → pre- or in-processing. Full pipeline control with a regulator who wants the calibration table → post-processing with documented thresholds.
Common pitfalls
The most damaging mistake is assuming that dropping the protected feature solves the problem. In any dataset of reasonable size, the protected attribute is reconstructible from proxies — zip code carries race in the US, given name carries gender almost perfectly, GitHub username carries age. The fix is to keep the protected attribute available for measurement on a separate audit table, run proxy detection (a model predicting the protected attribute from your features — if AUC > 0.7, you have leakage), and either suppress those proxies or use in-processing.
A second pitfall is measuring fairness only at the population level. Two segments can each pass the four-fifths rule while a third intersection — say women over 50 — fails badly. This is intersectional bias, and ignoring it is how Amazon's resume screener slipped past internal review. Always evaluate at the smallest segment your data supports stably, typically n ≥ 1000 per cell.
A third pitfall is picking the fairness metric your model already passes. The analyst equivalent of HARKing. Decide the metric before you look at disparities, ideally in a pre-registered metric doc reviewed by legal or a non-DS stakeholder.
The fourth pitfall is treating the threshold as a fairness control while leaving the score uncalibrated per group. If calibration curves differ — a score of 0.7 means 65% probability in group A and 45% in group B — any single threshold produces mismatched error rates. Calibration per group is a prerequisite, not an alternative, to threshold tuning.
The fifth pitfall is ignoring distribution shift after launch. Training-time fairness audits hold for the training distribution only. A new marketing channel or product line can re-introduce disparities your monthly dashboard misses. Bake the disparate-impact ratio into your model-monitoring SLOs with a paged alert at 0.75 and a soft alert at 0.85.
Related reading
- Bias-variance trade-off for DS interviews
- Selection bias explained simply
- Cross-validation strategies for DS interviews
- Confidence intervals for DS interviews
If you want to drill ML interview questions like this every day, NAILDD is launching with 500+ DS interview problems covering exactly this loop pattern.
FAQ
Can bias ever be fully eliminated from a model?
Practically, no. You can drive measurable disparity below regulatory thresholds — the four-fifths rule is the most common bar — and you can document the residual trade-off. But the impossibility results from Chouldechova 2017 and Kleinberg et al. 2016 prove that no single model can satisfy demographic parity, equalized odds, and calibration simultaneously when group base rates differ, which they almost always do. The honest answer in an interview is "minimize and monitor, do not promise elimination".
Is removing the protected attribute enough to make a model fair?
No, and saying yes in an interview is a near-instant downgrade. Protected attributes leak through proxies — zip code, school, name, browsing pattern. A useful diagnostic is to fit a separate model that tries to predict the protected attribute from your features alone; if its AUC is well above 0.5, you have proxy leakage that suppression has not removed. Use adversarial debiasing or proxy suppression with an audited held-out set.
How is fairness related to model interpretability?
They are complementary. Fairness metrics tell you whether the model treats groups equitably; interpretability tools — SHAP, integrated gradients, counterfactual explanations — tell you why. In a regulated industry you usually need both: a fairness audit to pass legal review, and feature-attribution explanations to respond to individual adverse-action notices. Many candidates conflate the two; keep them separate in your answer.
When does a model trigger a four-fifths rule violation in practice?
When the positive prediction rate of the disadvantaged group divided by the rate of the advantaged group falls below 0.80. For example, if your model approves 70% of group B and only 50% of group A, the ratio is 0.71 — a violation. Some regulators apply a stricter 0.90 or a statistical-significance test in place of the bright line, so check the specific jurisdiction and use case before quoting the number.
What do interviewers want me to say about accuracy versus fairness?
That they are usually in tension and that the trade-off is a business decision, not a technical one. Quantify it: train two models, one optimized purely for accuracy and one with a fairness constraint, and report the delta in both metrics. A typical pattern is a 1-3% accuracy drop to bring disparate impact from 0.65 up to 0.85. Putting numbers on the trade-off is what separates a senior answer from a junior one.