How to calculate data freshness in SQL
Contents:
Why data freshness matters
It is Monday at 9:14 a.m., the executive team is about to walk into a revenue review, and someone notices the dashboard's "yesterday" column is still showing Saturday's numbers. The warehouse is online, but the analyst is now writing an apology in Slack while the meeting starts without the numbers it was built around. The metric that catches this is data freshness, and "how would you measure it in SQL?" is now standard at every data engineering interview at Snowflake, Databricks, Airbnb, Netflix, Stripe, and DoorDash.
Data freshness is the gap between when an event happened and when it became queryable in the warehouse. It sounds like one number, and candidates who give one number end the interview early. Real systems have streaming pipelines that report in seconds, hourly batches in minutes, daily exports in hours, and edge cases — late-arriving mobile events, paused Kafka consumers, time-zone drift — that distort every one of those numbers.
This post walks the full SQL pattern: a baseline query, a per-pipeline breakdown, an SLA compliance query that powers alerting, a historical trend that catches regressions, and the pitfalls that make freshness numbers lie. Queries run in Postgres and need only mechanical changes for Snowflake, BigQuery, or Redshift.
Two formulas — wall clock and event time
Wall-clock freshness is the operational view. You measure the gap between now and the last successful pipeline run, regardless of what data that run loaded. The formula is now() - last_successful_run_timestamp. The strength is simplicity — it tells you the machinery is running. The weakness is that a job can finish on time while loading zero rows, and you only find out when the dashboard renders a flat line.
Event-time freshness is the analytical view. You measure the gap between now and the most recent event in the table — now() - max(event_time). The strength is that it surfaces real staleness: if the latest order is two hours old, the data is two hours old, no matter how recently the loader ran. The weakness is sensitivity to upstream distribution. A retail dataset with a 3 a.m. quiet period will look stale overnight even when everything is healthy.
A serious freshness program tracks both. If wall-clock says four minutes and event-time says four hours, the loader is running on no input — broken upstream, paused consumer, schema-change rejection. If event-time says four minutes and wall-clock says four hours, the loader is stuck on the last batch while events pile up in a queue nobody is draining. The reconciliation is the most useful diagnostic you will write.
The metadata data model
Every example assumes two tables. The first is a raw events table — orders, page views, clicks — with an event_time column. The second is a pipeline_metadata table where each row records the last attempt, the last success, the SLA in minutes, and the row count of the last batch.
CREATE TABLE pipeline_metadata (
pipeline_name TEXT PRIMARY KEY,
last_run_timestamp TIMESTAMPTZ,
last_successful_timestamp TIMESTAMPTZ,
rows_processed_last_run BIGINT,
status_last_run TEXT, -- 'success' | 'error' | 'running'
sla_minutes INT NOT NULL,
owner_team TEXT
);
CREATE INDEX ON pipeline_metadata (last_successful_timestamp);If your stack already emits pipeline runs as rows — Airflow, Dagster, Prefect, dbt Cloud — point the queries at that table and use a view exposing only the latest record per pipeline.
Baseline freshness in SQL
The first query every interviewer wants is a one-liner against a single events table. It computes the latest event_time, wall-clock now, and the lag in minutes. On an empty table MAX(event_time) returns null and an unguarded subtraction silently becomes null too — dangerous if you pipe it into an alert that treats null as zero.
SELECT
MAX(event_time) AS latest_event,
NOW() AS now_ts,
EXTRACT(EPOCH FROM (NOW() - MAX(event_time))) / 60 AS lag_minutes,
EXTRACT(EPOCH FROM (NOW() - MAX(event_time))) / 3600 AS lag_hours
FROM events
WHERE event_time IS NOT NULL;A streaming pipeline with a healthy producer should land under one minute. An hourly batch should land under sixty minutes after the loader finishes. Daily aggregates land under twenty-four hours. Numbers far below those bounds usually mean a clock-forward bug — future-dated events — and a negative lag rendered as a small positive.
For multiple tables, union them with a label so the result is one row per source — the version that goes on a dashboard.
SELECT
'events' AS table_name,
MAX(event_time) AS latest,
EXTRACT(EPOCH FROM (NOW() - MAX(event_time))) / 60 AS lag_minutes
FROM events
UNION ALL
SELECT
'transactions',
MAX(created_at),
EXTRACT(EPOCH FROM (NOW() - MAX(created_at))) / 60
FROM transactions
UNION ALL
SELECT
'sessions',
MAX(started_at),
EXTRACT(EPOCH FROM (NOW() - MAX(started_at))) / 60
FROM sessions
ORDER BY lag_minutes DESC;Ordering descending puts the staleest table on top — the row on-call cares about first.
Per-pipeline breakdown
Once you have more than a handful of pipelines, you stop looking at individual tables and start looking at the metadata layer. The query below pulls every critical pipeline, computes minutes since the last success, surfaces the row count as a sanity check, and sorts staleest first. Naming this query before being asked tells the panel you have been on-call for a data platform.
SELECT
pipeline_name,
owner_team,
last_run_timestamp,
last_successful_timestamp,
status_last_run,
rows_processed_last_run,
EXTRACT(EPOCH FROM (NOW() - last_successful_timestamp)) / 60 AS minutes_since_success,
sla_minutes
FROM pipeline_metadata
WHERE pipeline_name LIKE 'critical_%'
ORDER BY minutes_since_success DESC;The rows_processed_last_run column is the cheap fix for the empty-load failure mode — a pipeline that ran successfully but loaded zero rows usually points to an upstream schema change that rejected every record. The owner_team column lets on-call route the incident in one Slack message.
SLA compliance and status
The next query turns minutes into a status. Compare lag against the per-pipeline SLA, escalate to CRITICAL past 1.5x, flag WARN past 1.0x, and compute the SLA ratio so dashboards can sort by severity. The pattern mirrors a latency SLO, except the units are minutes-since-load instead of milliseconds-per-request.
WITH freshness AS (
SELECT
pipeline_name,
owner_team,
EXTRACT(EPOCH FROM (NOW() - last_successful_timestamp)) / 60 AS lag_min,
sla_minutes
FROM pipeline_metadata
)
SELECT
pipeline_name,
owner_team,
lag_min,
sla_minutes,
CASE
WHEN lag_min > sla_minutes * 1.5 THEN 'CRITICAL'
WHEN lag_min > sla_minutes THEN 'WARN'
ELSE 'OK'
END AS status,
lag_min / NULLIF(sla_minutes, 0) AS sla_ratio
FROM freshness
ORDER BY sla_ratio DESC NULLS LAST;The NULLIF(sla_minutes, 0) guard protects against pipelines with a missing SLA. The 1.5x threshold is a convention worth defending: a small overshoot is noise, a 50% overshoot is a degraded loader. Use 2.0x if alerts are too noisy, 1.2x if too quiet.
Historical freshness trend
Current freshness tells you whether the loader is healthy now. Historical freshness tells you whether it has been drifting over the last week — the regression that turns into a 9 a.m. incident if you do not catch it. The query below assumes a freshness_log table where you append per-pipeline lag every fifteen minutes, and reports average, max, and p95 per hour for the last seven days.
SELECT
DATE_TRUNC('hour', logged_at) AS hour,
pipeline_name,
AVG(lag_minutes) AS avg_lag,
MAX(lag_minutes) AS max_lag,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY lag_minutes) AS p95_lag
FROM freshness_log
WHERE logged_at >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY 1, 2
ORDER BY 1, 2;A slow upward drift in p95 across days usually means the loader's input has grown past its compute capacity — same SLA, more data, more time per run. A sudden step on a single day usually means a deploy: a new transformation added latency, or a schema change forced a slower code path. The anomaly detection in SQL post walks the rolling-window technique that catches both without you reading the chart every morning.
Common pitfalls
The first pitfall is conflating wall-clock and event-time freshness in one number. A loader that finishes on time but loads no rows reads as healthy on wall-clock and severely stale on event-time, and a dashboard that shows only one misses half of all incidents. The fix is to surface both side by side and treat the gap as a signal.
The second pitfall is late-arriving events. Mobile clients buffer events offline and flush them on reconnect, so an event with event_time = 09:00 may arrive at ingest_time = 14:00. If you compute freshness from event_time alone, a perfectly healthy backfill looks like five-hour staleness. The fix is to compute both event-time and ingest-time lag and flag staleness only when ingest lag exceeds the SLA.
The third pitfall is the empty-table trap. MAX(event_time) on zero rows returns null, and a naive lag expression returns null too. Some dashboards render null as zero, which displays as "perfectly fresh" while no data has arrived in a week. The fix is to alert on null lag as if it were unbounded — absence of data is the worst possible freshness, not the best.
The fourth pitfall is time-zone drift. A source writes timestamps in UTC, the warehouse stores them in a local zone, and the freshness query subtracts the two without converting. The result is a constant offset that looks like staleness on every query. The fix is to standardize on TIMESTAMPTZ and convert at ingest, not at query time.
The fifth pitfall is treating streaming and batch pipelines on the same axis. Streaming freshness is measured in seconds, batch in hours, and putting both on one chart trains the team to ignore streaming because it is always near zero. The fix is to bucket pipelines by SLA tier and report the SLA ratio rather than absolute lag, so a 30-second streaming overrun and a 30-minute batch overrun render as comparable.
Optimization tips
Pre-aggregate freshness logs into hourly summaries. A platform with 500 pipelines emitting a row every fifteen minutes generates 50 million rows per year — small in absolute terms, but a dashboard that recomputes 7-day averages on every refresh will scan all of them. A materialized view of hourly average, max, and p95 per pipeline drops the dashboard query from a full scan to a hundred-row pick.
Partition the events tables by event_time. The baseline query reads exactly one row per table — the row with the maximum event_time — but on an unpartitioned billion-row table the planner often falls back to a full scan to find it. Partitioning by date plus an index on event_time turns the same query into a metadata-only operation that returns in milliseconds. Snowflake clustering, BigQuery time-partitioning, and Postgres native partitioning all produce the same property.
Materialize the SLA ratio, not just the lag. The dashboard sorts by ratio, alerts on ratio, and visualizes ratio, but most teams store only lag_minutes and recompute the ratio every query. A generated column for lag_minutes / NULLIF(sla_minutes, 0) lets the dashboard do straight selection instead of arithmetic.
Related reading
- How to detect anomalies in SQL
- How to calculate availability and uptime in SQL
- How to calculate API latency in SQL
- SQL window functions interview questions
If you want to drill SQL questions like this every day, NAILDD is launching with 500+ SQL problems across exactly this pattern.
FAQ
What is a reasonable freshness SLA for a customer-facing dashboard?
For a B2B SaaS dashboard backed by an hourly batch, under sixty minutes is the bar customers expect — older feels broken. For consumer-facing dashboards driven by streaming pipelines, the bar drops to under five minutes for headline metrics and under thirty seconds for anything tied to a real-time alert. Internal exploratory dashboards can tolerate daily freshness, but flag the cutoff visibly so analysts do not draw conclusions from twenty-three-hour-old data.
Should I track freshness on wall-clock time or on event time?
Track both, side by side. Wall-clock catches loader failures — paused jobs, crashed workers, expired credentials. Event-time catches upstream failures — quiet producers, dropped topics, schema rejections. Reporting one alone misses half of all freshness incidents, and the gap between the two is itself the most useful diagnostic. If wall-clock is small and event-time is large, your loader is happily running on no input.
How do I handle late-arriving events without flagging them as staleness?
Compute two lags — event-time and ingest-time — and use ingest-time lag for alerting. Late events arrive with old event_time values but recent ingest_time values, so a five-hour event lag with a thirty-second ingest lag is a healthy backfill, not an incident. Streaming systems formalize this with watermarks, and you can implement the same idea in SQL by ignoring events older than the watermark when computing event-time lag.
How is freshness different from data latency?
Latency is a per-event measurement — how long a single event takes from producer to warehouse — usually a percentile distribution. Freshness is an aggregate snapshot — how stale the most recent data is, right now — usually one number per pipeline. Latency is what an SRE looks at when debugging a streaming pipeline; freshness is what an analyst checks before opening a dashboard.
Should freshness alerts page on-call, or only post to Slack?
Tier by SLA. A real-time pipeline missing its 30-second SLA by 1.5x should page — the metrics depending on it are driving customer-facing alerts. An hourly batch missing its SLA by 1.5x should ping a channel and create a ticket. A daily aggregate missing its SLA should land in a morning digest. The mistake is wiring every violation to the same channel, which trains on-call to ignore it within a month.