How to calculate autocorrelation in SQL
Contents:
Why autocorrelation matters
Your PM pings you on a Friday afternoon: "Daily signups look noisy, but is there a real weekly rhythm or are we chasing ghosts?" You have one chart to bring to the standup on Monday. The cheapest, most defensible answer is the autocorrelation function — the correlation of a series with a shifted copy of itself at lag one, seven, fourteen, and so on. If the lag-seven bar is tall, your data has a weekly heartbeat. If the bars decay slowly across every lag, you have trend. If they sit near zero, what looks like a pattern is noise.
This question shows up in time-series rounds at Stripe, DoorDash, Uber, Netflix, Airbnb, Snowflake, and pretty much any company that runs forecasts on consumer metrics. The framing is almost always the same: here is a daily series, compute the ACF in SQL, then tell me what to do before fitting a forecast. Strong candidates write the LAG plus CORR pattern in under five minutes and call out one trap — usually trend or sample size — before the interviewer asks. Weak candidates reach for Python because they think SQL cannot do statistics.
Postgres, Snowflake, BigQuery, and Redshift all ship a CORR aggregate and a LAG window function — the entire toolbox you need. A short query gives you ACF for a single lag, and the same pattern extends to a full ACF table. This post walks the formula end to end, builds the table, then collects the traps that get candidates dinged.
The formula in one line
The autocorrelation function at lag k is the Pearson correlation between the series and itself shifted by k steps:
ACF(k) = corr(y_t, y_{t-k})
= sum( (y_t - mean) * (y_{t-k} - mean) ) / sum( (y_t - mean)^2 )The range is minus one to one. A value of one means the series at time t is a perfect linear function of the series at time t minus k — a strong rhythm at that lag. A value near zero means the two are unrelated. Negative values mean the shifted series moves in the opposite direction, which is rare on business metrics but shows up on engineered features and over-corrected forecasts.
The mean used inside the formula is the mean of the full series, not the mean of either half. That detail matters when you implement ACF from scratch — and it is the reason every SQL engine ships CORR so you do not have to compute it by hand.
ACF for a single lag
In Postgres or any engine with CORR and LAG, computing ACF at a chosen lag is short enough to fit in one screen. The pattern is: shift the column by k rows with LAG(value, k) OVER (ORDER BY date), drop the rows where the shifted value is null, then aggregate with CORR.
WITH lagged AS (
SELECT
event_date,
value AS y,
LAG(value, 7) OVER (ORDER BY event_date) AS y_lag_7
FROM daily_metrics
WHERE metric_name = 'dau'
)
SELECT
CORR(y, y_lag_7) AS acf_7
FROM lagged
WHERE y_lag_7 IS NOT NULL;Reading the output: if acf_7 lands around 0.85 on a daily DAU series, you have a strong weekly cycle and any forecast you fit needs a seven-day seasonal component. If it lands around 0.1, the rhythm is too weak to model and a flat baseline will beat any seasonal model on holdout. As a rule of thumb, anything above 0.3 on more than a hundred points is real signal.
One subtle thing about LAG: the first k rows of the result have y_lag_7 set to null. CORR skips those rows, which is what you want, but the effective sample size is N minus k, not N. On short series this matters — covered below.
Building the ACF table
A single lag rarely tells the whole story. The standard move is to compute the ACF for lag one through thirty on a daily series, or lag one through twelve on a monthly series, and plot the bars side by side. A single spike at lag seven means clean weekly seasonality, a slow linear decay means trend, a tall lag-one bar that drops off cleanly means high day-to-day memory but no longer-term structure.
The query below builds a five-column ACF table — lags one, seven, fourteen, twenty-one, thirty — which is the typical interview answer when the prompt says "show me weekly and monthly autocorrelation."
WITH base AS (
SELECT
event_date,
value AS y
FROM daily_metrics
WHERE metric_name = 'dau'
),
lags AS (
SELECT
event_date,
y,
LAG(y, 1) OVER (ORDER BY event_date) AS y1,
LAG(y, 7) OVER (ORDER BY event_date) AS y7,
LAG(y, 14) OVER (ORDER BY event_date) AS y14,
LAG(y, 21) OVER (ORDER BY event_date) AS y21,
LAG(y, 30) OVER (ORDER BY event_date) AS y30
FROM base
)
SELECT
CORR(y, y1) AS acf_1,
CORR(y, y7) AS acf_7,
CORR(y, y14) AS acf_14,
CORR(y, y21) AS acf_21,
CORR(y, y30) AS acf_30
FROM lags;The output is a single row with five columns, each between minus one and one. In a dashboard the same query feeds a horizontal bar chart with one bar per lag. That is usually enough to ship the answer to the PM and pick a forecast model on Monday morning.
For every lag from one to thirty without writing thirty LAG lines, cross-join the series against generate_series(1, 30) AS k and roll the correlation up in the outer query. In an interview, five hand-written lags are clearer — the reviewer is looking for the pattern, not the elegance.
How to read the ACF
A tall first bar — acf_1 above 0.9 — means the series has strong short-term memory. Revenue per day, monthly active users, and rolling-thirty-day retention all sit in this regime: the value tomorrow is almost the value today plus a small change. Tall first bars by themselves do not tell you about trend or seasonality, only that day-to-day shocks are small.
A slow linear decay across lags one through thirty means the series has a trend. The autocorrelation is not telling you about the dynamics of the metric — it is telling you the series is going up or down over time. Detrend before fitting any model: subtract a linear regression of value on date, take first differences, or use a moving-average baseline.
A spike at lag seven that towers over lags one through six is the signature of weekly seasonality, the dominant pattern on consumer metrics — DAU, orders, signups, support tickets. The spike repeats at lags fourteen, twenty-one, and twenty-eight, each one a little smaller. If you see this shape, your forecast model needs a weekly seasonal component or you will systematically miss Mondays and Fridays.
A spike at lag twelve on a monthly series is the equivalent for yearly seasonality, which shows up on retail, travel, and any business with holiday cycles. You need at least two years of monthly data to see this spike reliably.
A flat ACF near zero across every lag is a random walk after differencing, or pure noise before. There is no structure to model: a naive forecast that uses yesterday's value is as good as anything more elaborate.
Common pitfalls
The most common mistake is computing ACF on a series that is too short. Below fifty points the sample correlation is dominated by noise, and any spike at any lag can look impressive without meaning anything. The practical floor is fifty observations for an exploratory ACF and a hundred for anything you would present to a PM. If you only have thirty days, say so and lean on a different signal.
A close second is leaving the trend in. If your series is growing linearly because the product is growing, ACF at every lag will look high — two nearby points will both sit near the local mean of the trend. This is a measurement artifact, not structure. Detrend first: subtract a linear regression of value on date, take first differences with value - LAG(value, 1), or apply a seven-day rolling mean and subtract it. Re-run ACF on the residual.
A third trap is silently dropping rows. LAG(value, 7) produces nulls in the first seven rows, and CORR aggregates over the non-null pairs. The math is right, but the effective N drops, which changes the confidence band. On long series this is invisible; on series under two hundred points it can flip a borderline lag from "significant" to "noise." Always print COUNT(*) alongside CORR.
A fourth pitfall is comparing ACFs from series of very different lengths. The standard error of the ACF estimate at lag k is roughly one over the square root of N — about 0.2 for a hundred-point series, 0.06 for a thousand-point series. A spike of 0.15 is noise on the short series and a clean signal on the long one. Always anchor the significance threshold to the sample size.
A fifth pitfall is ignoring the confidence band entirely. An ACF value is statistically significant if its absolute value exceeds two over the square root of N — about 0.2 for N equal to a hundred. Anything inside that band is consistent with noise. If you are reporting raw numbers, mention the threshold so your reader does not over-interpret a 0.15 as "weekly seasonality."
Optimization tips
On daily metric tables under a million rows the queries above run in milliseconds. The work starts when you compute ACF on raw event tables with hundreds of millions of rows — LAG has to sort the entire result set before shifting. Roll up to a daily metric table first: one row per metric per day, indexed on date. The cost of the shift drops from minutes to milliseconds.
The second lever is partitioning the source by date. On Snowflake, BigQuery, and Databricks the daily metric table should be clustered or partitioned on event_date from the start. ACF queries always filter on a date range, and clustered storage cuts scanned bytes by one to two orders of magnitude. On Postgres the equivalent is a B-tree index on event_date plus declarative range partitioning.
The third lever is caching the ACF results. Yesterday's ACF is almost identical to today's — one new point in a hundred barely moves the correlation. A small acf_daily table refreshed once a day is enough for any dashboard that needs ACF as input.
Related reading
- SQL window functions interview questions
- How to detect anomalies in SQL
- How to calculate active days in SQL
- A/B testing peeking mistake
If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
What counts as a significant ACF value?
The conventional threshold is two over the square root of N — roughly 0.2 for a hundred-point series and 0.06 for a thousand-point series. Anything outside that band has less than a five-percent chance of arising from pure noise. Inside the band you have no evidence of structure, so resist the urge to read patterns into bars that are not there. This is the Bartlett confidence band that statsmodels and R both draw on their ACF plots by default.
What is the difference between ACF and PACF?
ACF at lag k measures the total correlation between the series and itself shifted by k, including correlation flowing through intermediate lags. PACF removes those intermediate effects and gives you the direct correlation at lag k after controlling for lags one through k minus one. In practice the ACF tells you about seasonality and the PACF tells you the order of the AR component in an ARIMA model. For a forecast model interview you usually want both.
Can I run ACF on a ratio metric like conversion rate?
You can, but the variance of a ratio depends on the denominator, so raw ACF is biased toward the days with small denominators. Take a log transform first — LN(value) — to stabilize the variance and make the correlation estimate well behaved. For very small numerators or denominators, switch to a beta-binomial baseline instead of computing ACF on the raw ratio.
Is there a Postgres way to do this without LAG?
Technically yes — self-join the table on t.date = u.date - INTERVAL '7 days' and apply CORR on the joined columns. The output is the same, but the planner usually picks a worse plan and the query is harder to read. Stick with LAG.
Which lags should I look at?
For a daily metric: lag one for short-term memory, lag seven for weekly seasonality, lag fourteen and twenty-one to confirm the weekly spike repeats, and lag thirty for monthly structure. For a monthly metric: lag one, lag six for half-yearly cycles in retail and travel, and lag twelve for yearly seasonality. For an hourly metric add lag twenty-four for daily cycles. The right set depends on the natural cycles of the business.
What if my series has gaps?
Gaps break the ACF estimator because LAG shifts by row, not by date. Either fill the gaps with forward-fill or zero-fill and accept some bias, or compute ACF on the longest contiguous segment and report N alongside the result. For production monitoring, build a calendar table and left-join your metric onto it so every date has a row — the gaps become visible in the data instead of hiding inside the lag.