Correlation explained simply

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

What correlation actually is

Correlation is a single number that tells you how tightly two variables move together. If a user spends more minutes in the app and also clicks more buttons, those two variables are positively correlated. If raising the price of a coffee makes fewer people buy it, price and units are negatively correlated. The correlation coefficient compresses that relationship into one value between minus one and plus one, which is why it shows up in every analytics deck and every data interview at companies like Stripe, Airbnb, DoorDash, Netflix, and Snowflake.

The most common flavor is the Pearson correlation, written as r. It measures the strength of the linear relationship between two numeric variables. A value of plus one means the points fall on a perfectly upward-sloping line, minus one means a perfectly downward-sloping line, and zero means no linear pattern. Everything in between is a partial signal, and the closer the absolute value is to one, the cleaner the line.

Three things make correlation useful in practice. It is symmetric, so the correlation of x with y equals y with x. It is scale-free, so the units do not matter — revenue in dollars or cents gives the same number. And it is cheap to compute, both in Python with a single NumPy call and in SQL with a single CORR aggregate.

How to read the number

The mental table below is the one you should be able to recite in an interview without looking it up.

r value Strength of linear relationship
±1.0 Perfect — points fall on a straight line
±0.7 to ±0.9 Strong
±0.4 to ±0.6 Moderate
±0.1 to ±0.3 Weak
0 No linear relationship

The sign tells you direction. Positive means both variables move together; negative means one goes up when the other goes down. The magnitude tells you how tight the relationship is around a straight line — not how steep that line is.

A subtle point that trips people up: r is invariant to slope. A line with slope 0.001 and a line with slope 1000 can both produce r equal to one as long as the points sit exactly on those lines. If you want to know how much y changes per unit of x, you want a regression slope, not correlation.

R-squared is the square of r and lives between zero and one. It is the share of the variance in y explained by x in a simple linear regression. An r of 0.7 maps to an r-squared of 0.49, meaning the line explains about half the variability in y. That is also why interviewers will downgrade an "r equals 0.6" claim — half the variance is still unexplained.

A worked example by hand

Say you collected five users and recorded their weekly active days and weekly spend.

user  days  spend
A     1     5
B     2     11
C     3     14
D     4     21
E     5     24

The means are 3 and 15. Subtract the means from every value: deviations of (-2, -1, 0, +1, +2) for days and (-10, -4, -1, +6, +9) for spend. Multiply pairs and sum: 20 + 4 + 0 + 6 + 18 = 48. Square the day deviations and sum: 4 + 1 + 0 + 1 + 4 = 10. Square the spend deviations: 100 + 16 + 1 + 36 + 81 = 234.

Plug it into the Pearson formula. Numerator 48, denominator sqrt(10 * 234) which is about 48.37. So r equals roughly 0.992 — a near-perfect linear relationship matching the eye test on the table.

Now flip the spend values to 5, 11, 14, 21, 6. Suddenly user E is an outlier — five days but only six dollars. Recompute and r drops to roughly 0.14, even though four of the five points still fit a tight line. One bad row destroyed the signal. That is the outlier sensitivity you have to flag whenever you use Pearson on real product data.

Pearson, Spearman, and Kendall

Pearson is the workhorse and the first thing you reach for. It assumes the relationship between x and y is linear, the variables are reasonably symmetric in distribution, and outliers are not pulling the result around. When all three hold, Pearson is the tightest estimator available.

Spearman correlation is Pearson applied to the ranks of the values rather than the raw values. It measures monotonic association — does y consistently increase as x increases, regardless of whether the relationship is a straight line or a smooth curve. Spearman is robust to outliers (a single huge value still only has rank N) and works on ordinal data like NPS buckets or pricing tiers. Reach for Spearman when the scatter looks curved but consistently uphill, or when one column has a long tail you do not want to log-transform.

Kendall's tau is a third option, also rank-based, that counts concordant and discordant pairs. It is more conservative than Spearman and handles ties more gracefully, but most teams default to Spearman because it is faster on big tables. Rule of thumb: Pearson for linear numeric data with no heavy outliers, Spearman for monotonic but non-linear or ordinal data, Kendall on small samples with many ties.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Correlation in Python and SQL

In Python with NumPy, Pearson r is a single call. For Spearman or Kendall, switch to SciPy.

import numpy as np
from scipy.stats import spearmanr, kendalltau

x = np.array([1, 2, 3, 4, 5])
y = np.array([5, 11, 14, 21, 24])

r = np.corrcoef(x, y)[0, 1]    # Pearson, ~0.992
rho, _ = spearmanr(x, y)        # Spearman
tau, _ = kendalltau(x, y)       # Kendall

In pandas, the .corr() method on a DataFrame returns the full correlation matrix — the move when you want to scan dozens of feature pairs at once.

import pandas as pd

df = pd.DataFrame({"days": x, "spend": y, "sessions": [2, 5, 7, 9, 12]})
df.corr()                       # Pearson by default
df.corr(method="spearman")      # Spearman across every pair

In SQL, Pearson is built into Postgres, Snowflake, BigQuery, Redshift, and ClickHouse as the CORR aggregate. Filter to non-null rows, call the function, and ship.

SELECT
    CORR(user_age_days, lifetime_value) AS r,
    COUNT(*)                            AS n
FROM users
WHERE user_age_days IS NOT NULL
  AND lifetime_value IS NOT NULL;

For Spearman, compute ranks first with PERCENT_RANK or RANK, then call CORR on the rank columns. MySQL lacks CORR entirely — you expand the formula by hand or push computation to the application layer.

Correlation is not causation

This is the line every interviewer wants to hear and that every junior analyst forgets. A high correlation between two variables tells you they move together in the data you have. It does not tell you that one is causing the other.

The classic example is ice cream sales and drowning rates. Both rise together across the summer and Pearson r between them is near plus 0.9. Eating ice cream does not push anyone into the water. The hidden third variable is hot weather, which drives both. Statisticians call this a confounder, and confounders are everywhere in observational product data.

The fix is not a statistical trick on top of correlation — it is a different method entirely. The cleanest is a randomized experiment, an A/B test, where you flip a coin to assign users to treatment versus control. Random assignment breaks the link between treatment and any confounder, so the difference you measure has a causal interpretation. When you cannot randomize, causal inference methods like difference-in-differences, instrumental variables, regression discontinuity, and synthetic control let you make causal claims under specific assumptions.

The interview-grade version: correlation gives you a hypothesis, an A/B test gives you a conclusion. If you see a high correlation in a dashboard, ask "what experiment would I run to verify this is causal" — not "what launch doc do I write".

Common pitfalls

The most common pitfall is calling correlation causation in a meeting. You report that retention is correlated with using feature X, the room hears "feature X drives retention", and three weeks later there is a roadmap line item that should never have shipped. The fix is to lead with the words "correlated with" every time, then add "we would need a holdout test to confirm a causal effect". Never let an executive walk away with the wrong mental model from your dashboard.

A second pitfall is using Pearson on a non-linear relationship. You compute r on a parabolic scatter, get zero, and conclude the variables are unrelated when the eye test shows the opposite. Always plot before trusting the number. If the shape is not an upward or downward cloud, switch to Spearman or mutual information.

A third pitfall is letting outliers dominate the answer. One whale user with a thousand dollars of spend can swing Pearson r between session count and revenue from 0.7 to 0.95 or down to 0.2, depending on where the outlier sits. The fix is to flag outliers visually, then report two numbers — the correlation with outliers and the correlation after winsorizing or switching to Spearman, which downgrades outliers automatically.

A fourth pitfall is computing correlation on a non-representative subsample. You filter to "high-value subscribers", find r equals 0.9 between retention and feature usage, and conclude pushing feature usage will fix retention. The conclusion does not generalize — you measured it on a segment that is already retaining. Compute correlations on the population you want to act on, or clearly label your scope.

A fifth pitfall is confusing statistical significance with practical significance. With ten million rows, a correlation of 0.03 will come back with a p-value below 0.001, and a junior analyst will report it as "significant". It is technically significant, but r-squared is 0.0009 — the variable explains less than a tenth of a percent of the variance. Always report effect size alongside the p-value, and use the r magnitude table above to call out "weak" or "negligible" in plain language.

A sixth pitfall is confusing correlation with effect size. An r of 0.7 between price and units says the relationship is tight; it does not say how many fewer units a one-dollar hike costs you. That number comes from the regression slope. For forecasts and pricing decisions, fit a regression and quote the slope.

If you want to drill correlation and stats questions like this every day, NAILDD is launching with hundreds of SQL and statistics interview problems built around exactly this pattern.

FAQ

What does the correlation coefficient actually measure?

It measures the strength and direction of the linear relationship between two variables on a scale from minus one to plus one. The sign tells you direction — positive means both move up together, negative means they move opposite. The magnitude tells you how tightly the points hug a straight line. It does not measure the slope of that line or the size of the effect, only how clean the linear pattern is.

Should I use Pearson or Spearman?

Default to Pearson when both variables are numeric, the relationship looks linear in a scatter plot, and there are no extreme outliers. Switch to Spearman when the relationship is monotonic but curved, when one variable is ordinal like a satisfaction rating or pricing tier, or when outliers are dragging Pearson around.

Does a high correlation mean one variable causes the other?

No. Correlation tells you two variables move together in the observed data, not that one is responsible for the other. A confounder — like hot weather driving both ice cream sales and drowning rates — can produce a strong correlation between two variables with no direct causal link. To make a causal claim you need an A/B test or a causal inference method like difference-in-differences.

Does a zero correlation mean the variables are independent?

Not necessarily. Pearson r equals zero means no linear relationship, but a strong non-linear relationship can still exist. The textbook example is y equals x squared on a symmetric range — Pearson is exactly zero, yet knowing x tells you y precisely. Always plot the scatter; if it looks curved, switch to Spearman or compute mutual information.

How big does the sample need to be for correlation to be reliable?

With fewer than thirty observations, the confidence interval around r is so wide that you can rarely make a confident call. A sample correlation of 0.5 on twenty rows might have a 95% confidence interval spanning roughly 0.05 to 0.8. Aim for at least one hundred observations before quoting r as meaningful, and at least one thousand if you want a tight enough interval to act on.

How is correlation different from covariance?

Covariance is the unstandardized version and depends on the units of both variables. Correlation divides covariance by the product of the two standard deviations, which strips out units and forces the result into minus one to plus one. Covariance shows up as an intermediate in formulas, but you almost never report it directly because the magnitude is hard to interpret. Correlation is what you put in the deck.