Correlation vs covariance
Contents:
The one-line answer
Covariance tells you the direction of joint variation between two variables and lives in the units of X times Y. Correlation is the same idea, rescaled to the interval from minus one to plus one and stripped of units — which is why it is the number you actually quote to product managers, executives, and interviewers.
If you remember one sentence, remember this: correlation is normalized covariance. Pearson correlation divides covariance by the product of the two standard deviations. That single division fixes the biggest problem with raw covariance, which is that the number itself is meaningless without knowing the scales involved.
A data scientist at Stripe analyzing merchant volume against chargeback rate does not say "covariance is 4,217.5." They report "the correlation is 0.62, moderately positive." The first is correct and useless; the second can be acted on.
Formulas you should memorize
The sample covariance between X and Y over n observations is the average centered product, with the standard Bessel correction:
cov(X, Y) = sum_i (x_i - x_bar)(y_i - y_bar) / (n - 1)Pearson correlation r divides covariance by the product of the sample standard deviations:
r(X, Y) = cov(X, Y) / (sigma_X * sigma_Y)Each standard deviation carries the same units as its variable, so the units cancel and r is dimensionless. The bound |r| <= 1 is a consequence of the Cauchy-Schwarz inequality applied to the centered vectors.
Two practical notes. When interviewers say "correlation" they usually mean Pearson, but with skewed or non-linear data you might switch to Spearman or Kendall. And the formula above is the sample version; the population version uses n and population means. In interviews, use the sample form unless told otherwise.
Worked example: height and weight
Take five adults with heights in centimeters and weights in kilograms.
X (height, cm): 170, 175, 180, 185, 190
Y (weight, kg): 60, 65, 75, 80, 90The mean of X is 180 and the mean of Y is 74. Centering and summing the cross-products yields a sample covariance of about 62.5 in units of cm * kg. The sample standard deviations are roughly 7.91 and 11.94. Dividing gives a Pearson correlation of about 0.988.
Is 62.5 "a lot" of covariance? You cannot say without knowing the scales. Convert heights to meters and the same relationship gives a covariance of 0.625 — a hundred times smaller — even though nothing about the association changed. Is 0.988 "a lot" of correlation? Yes, unambiguously, on every scale, in every unit system. That invariance under linear rescaling is the entire point.
A favourite follow-up in Meta and Airbnb data science loops is "would correlation change if I add ten centimeters to every person?" No — adding a constant shifts the mean but not the centered deviations, so covariance, both standard deviations, and r are preserved. Multiplying by a constant scales covariance but cancels in correlation.
Key differences at a glance
| Property | Covariance | Correlation |
|---|---|---|
| Units | X * Y |
dimensionless |
| Range | (-infinity, +infinity) |
[-1, +1] |
| Interpretation | hard, scale-dependent | easy, scale-free |
| Invariant to linear rescaling? | no | yes |
| Comparable across pairs? | no | yes |
| Used in PCA, regression math | yes | sometimes |
| Used in business reports | almost never | almost always |
In a live interview, sketch the worked example first, then point at the table.
When to use covariance vs correlation
Reach for raw covariance when you are doing the math of a downstream model and units matter. Portfolio optimization in quantitative finance is the textbook case: the covariance matrix of asset returns feeds a mean-variance optimizer, and the answer has economic meaning. Another is PCA on features that share a meaningful scale — eigenvectors of the covariance matrix point in directions of largest variance in original units.
Reach for correlation whenever you want to report, compare, or interpret. Feature selection in ML inspects correlation because the goal is to detect relationships independent of scale. Exploratory analysis is almost always correlation-first because heatmaps in [-1, +1] are readable while raw-covariance heatmaps look like noise. Business communication is exclusively correlation.
The hybrid case is PCA on features measured in different units. PCA on the covariance matrix is dominated by whichever feature has the largest variance in absolute terms. PCA on the correlation matrix — equivalently, the covariance matrix of standardized features — lets every feature contribute equally. Neither is universally correct; the right answer depends on whether the original units carry real meaning.
In Python and pandas
Both NumPy and pandas hand you the numbers in one call. NumPy returns a two-by-two matrix and you read the off-diagonal.
import numpy as np
x = [170, 175, 180, 185, 190]
y = [60, 65, 75, 80, 90]
cov_xy = np.cov(x, y)[0, 1] # ~62.5
corr_xy = np.corrcoef(x, y)[0, 1] # ~0.9877pandas is more symmetric and works on full DataFrames, which is what production code actually uses.
import pandas as pd
df = pd.DataFrame({"height": x, "weight": y})
df.cov() # covariance matrix
df.corr() # Pearson by default
df.corr(method="spearman") # rank correlation, robust to outliersFor warehouses, most engines expose CORR directly; otherwise compute SUM((x - avg_x) * (y - avg_y)) inside a grouped aggregation. For a SQL walkthrough see the deep dive linked below.
Covariance and correlation matrices
For multivariate data, both concepts generalize. The covariance matrix for three variables is a three-by-three symmetric matrix with variances on the diagonal and pairwise covariances off the diagonal.
X Y Z
X [ var(X), cov(X,Y), cov(X,Z) ]
Y [cov(Y,X), var(Y), cov(Y,Z) ]
Z [cov(Z,X), cov(Z,Y), var(Z) ]This matrix is the central object in several techniques. PCA diagonalizes it for orthogonal directions of maximum variance. The closed-form ordinary least squares solution beta = (X^T X)^-1 X^T y is essentially manipulating a scaled covariance matrix. Multivariate Gaussians are parameterized by a mean vector and a covariance matrix. In quantitative finance, the covariance matrix of asset returns is the heart of every risk model.
The correlation matrix is the same shape, normalized so the diagonal is exactly one and every off-diagonal sits in [-1, +1].
X Y Z
X [1.00 0.85 0.40]
Y [0.85 1.00 0.20]
Z [0.40 0.20 1.00]Convert between them in two lines: divide the covariance matrix entrywise by the outer product of the standard deviation vector with itself to get correlation. Going back requires the standard deviations.
Interview answers ready to ship
"What is covariance?" The average product of how much two variables deviate from their means, measured in the units of the product of those variables.
"How is it different from correlation?" Correlation is covariance divided by the product of the two standard deviations, which removes the units and rescales into [-1, +1]. Consequence: correlation is comparable across pairs of variables, covariance is not.
"What is a good covariance value?" None exists without knowing the units. Pivot to correlation, where rough conventions are: above 0.7 in absolute value is strong, 0.3 to 0.7 is moderate, below 0.3 is weak. Rules of thumb, not laws.
"Why do we care about the covariance matrix specifically?" Two concrete uses: PCA, where its eigendecomposition gives the projection directions, and OLS, where the closed-form solution involves the inverted Gram matrix X^T X, which is proportional to the feature covariance matrix.
Common pitfalls
The first pitfall is interpreting raw covariance as if it were correlation. A candidate sees cov(revenue, sessions) = 1_000_000 and concludes the relationship is "very strong." That has no basis: maybe revenue is in cents and sessions are in millions, and the same underlying relationship in different units could yield 100 or 100,000,000. The fix is mechanical — always normalize before judging strength.
The second pitfall is conflating zero correlation with statistical independence. Pearson correlation measures only linear association. The classic counterexample is Y = X^2 over a symmetric range like X in [-1, +1]: the variables are deterministically related but their Pearson correlation is exactly zero. The fix is to always pair a correlation report with a scatter plot or a non-linear diagnostic before claiming independence.
The third pitfall is using covariance to compare strengths across different variable pairs. "Pair A has covariance 50, pair B has covariance 100, so B is stronger" ignores that A and B might live on completely different scales. The fix is again to standardize: convert both to correlations, then compare. If you genuinely need raw covariances — for instance, in a risk model aggregating dollar variances — be explicit you are comparing in the same units, not measuring association strength.
The fourth pitfall is forgetting that Pearson correlation is sensitive to outliers. A single extreme point in the upper right of a scatter plot can drag r from 0.05 to 0.8. The fix is twofold: visualize first to catch the outlier, and consider Spearman rank correlation when the data has heavy tails or when you trust the ranks more than the values.
The fifth pitfall, specific to multivariate analysis, is treating the covariance matrix as if it were always invertible. With more features than observations, or with perfectly collinear features, the matrix is singular and any model relying on its inverse — including closed-form OLS — will blow up. The fix is to regularize (ridge adds a small multiple of the identity), to drop collinear features, or to switch to an iterative solver.
Related reading
- Correlation explained simply
- How to calculate correlation in SQL
- How to calculate linear regression in SQL
- How to calculate cross-correlation in SQL
If you want to drill questions like these every day with structured solutions, NAILDD is launching with 500+ SQL and statistics problems built around real interview patterns at top tech companies.
FAQ
Can covariance be negative?
Yes. A negative covariance means the two variables tend to move in opposite directions relative to their means: when X is above its average, Y tends to be below, and vice versa. A real-world case is the relationship between a defensive asset and a risk asset in portfolio analytics, where negative covariance is precisely what makes diversification valuable. The sign of correlation always matches the sign of covariance, since the denominator in the correlation formula is a product of standard deviations, which are non-negative.
Why is correlation bounded between minus one and plus one?
It is a direct consequence of the Cauchy-Schwarz inequality applied to the centered vectors of X and Y. The inequality says the absolute value of the inner product of two vectors is at most the product of their norms. In statistics: the absolute value of covariance is at most the product of the two standard deviations, which is exactly the denominator in the correlation formula. The ratio is therefore bounded by one in absolute value, and the bound is achieved exactly when one variable is an affine function of the other — the formal meaning of "perfectly linearly related."
Is the covariance matrix always symmetric?
Yes, because cov(X, Y) = cov(Y, X) by definition — the centered product does not depend on the order of factors. A covariance matrix is also always positive semi-definite: all its eigenvalues are non-negative. This guarantees that variances of linear combinations of variables are non-negative, which is essential for the math of PCA and portfolio variance to be coherent.
Should I use covariance or correlation in machine learning?
For feature selection and exploratory data analysis, correlation is almost always the right call because it is scale-free and easy to threshold. For PCA you have a genuine choice: use the covariance matrix when features share meaningful units and you want larger-variance features to dominate, use the correlation matrix when features live on different scales and you want each to contribute equally. For algorithms that rely on the inverse of X^T X — OLS, linear discriminant analysis, Mahalanobis distance — you are working with covariance whether you call it that or not.
Does a high correlation imply causation?
No, and this is the single most common confusion in applied statistics. Correlation only tells you two variables move together; it says nothing about which causes which, whether a third variable causes both, or whether the association is a sampling artifact. The classic example is ice cream sales and drowning deaths, both spiking in summer because of temperature. Establishing causation requires experimental design — randomized A/B tests are cleanest — instrumental variables, regression discontinuity, or other quasi-experimental techniques. If an interviewer asks about correlation and you mention causation, flag the distinction explicitly.
What is the difference between Pearson, Spearman, and Kendall correlation?
Pearson measures linear association between two continuous variables and is optimal under roughly elliptical distributions. Spearman is Pearson applied to the ranks rather than the raw values, which makes it robust to outliers and able to detect any monotonic relationship, not just linear ones. Kendall's tau is also rank-based but measures concordance directly — the fraction of pairs that move in the same direction minus the fraction that move in opposite directions — and is more interpretable but more expensive on large samples. In an interview, Pearson is the default and Spearman is the answer when the data is skewed, heavy-tailed, or only ordinal.