Airflow backfill and catchup on a Data Engineering interview
Contents:
Why interviewers ask about backfill
If you have ever shipped an Airflow DAG to production, you have eventually had to backfill it. Either the business logic changed and last quarter's numbers need to be recomputed, an upstream source landed late, or a silent bug poisoned three weeks of partitions before anyone noticed. Backfill is the operational tax of every batch pipeline, and interviewers at Snowflake, Databricks, Stripe, and Airbnb know it. So they ask. A typical DE phone screen at a mid-sized data org includes at least one question about it, and a senior loop will go deeper — into catchup, depends_on_past, and the subtleties of data_interval_start.
The reason it shows up so often is that backfill cleanly separates two kinds of candidates: those who have read the docs and those who have actually had to recover from a bad deploy at 2 a.m. The first group can recite airflow dags backfill -s ... -e .... The second group knows that idempotency is the single property without which backfill is a footgun, that catchup=False is the safer default for new DAGs, and that running backfill on a non-idempotent DAG is how you get duplicated revenue numbers in a board deck.
Treat backfill questions as a test of your operational instincts, not your CLI memory.
Catchup and the first DAG run
When you activate a DAG whose start_date is in the past, Airflow looks at every schedule interval between start_date and now and decides whether to create runs for them. The control is the catchup flag on the DAG definition. With catchup=True — the historical default — Airflow will spawn one run per missed interval as soon as the scheduler picks the DAG up. That can be hundreds of runs at once for a daily DAG with a six-month-old start_date, which is great if you actually want the history filled in and terrible if you do not.
from datetime import datetime
from airflow.decorators import dag
@dag(
dag_id="daily_revenue",
start_date=datetime(2026, 1, 1),
schedule="@daily",
catchup=True, # historical default
)
def daily_revenue():
...Flipped to catchup=False, the scheduler only creates a run for the current interval and any future ones. No retroactive avalanche. This is the right choice for most new DAGs, because if you genuinely need history you can backfill it manually and observe the load. Airflow 2.3+ defaults catchup to whatever you pass, but new project templates in Airflow 3 tend to ship with catchup=False out of the box.
Gotcha: changing start_date on a deployed DAG does not retroactively change runs. Airflow uses the start_date from the first time it serialized the DAG. To re-anchor, delete existing DAG runs or create a new dag_id.
Manual backfill from the CLI
The CLI command every DE should have in muscle memory is:
airflow dags backfill \
--start-date 2026-04-01 \
--end-date 2026-04-30 \
daily_revenueThat creates and runs DAG runs for every day in April, respecting dependencies. The interesting flags are the ones you reach for during incident response:
--reset-dagrunswipes existing runs in the window and re-creates them. Use when bad data has already been written and you need a clean slate.--rerun-failed-tasksonly re-runs tasks marked failed. Cheaper than--reset-dagrunswhen most of the window succeeded.--ignore-dependenciesskips upstream task checks. Useful when you know upstream is fine but want to re-run a single downstream aggregation. Dangerous when you are wrong about upstream.--task-regex 'agg_.*'limits the backfill to matching task IDs. Critical when you changed one transformation and do not want to re-run a three-hour ML training step.
In Airflow 3.0 the airflow backfill create subcommand replaces the old single-shot invocation and gives you a backfill ID you can monitor and cancel — the older command still works as an alias, but new docs lean on the new form.
depends_on_past and wait_for_downstream
depends_on_past=True is a per-task setting that says: this task will only run if the same task in the immediately previous DAG run succeeded. It is the right answer when your pipeline is stateful — a running balance, a cumulative metric, a Type 2 SCD where today's row references yesterday's effective_to date.
from airflow.operators.python import PythonOperator
aggregate = PythonOperator(
task_id="aggregate_running_balance",
python_callable=run_aggregate,
depends_on_past=True,
)The cost is that a single failed run halts every subsequent run for that task until you intervene. During backfill this gets worse, because Airflow processes intervals in order and one bad day pauses the whole sequence. The fix is usually to fix the bad day rather than to drop depends_on_past — the flag is doing its job.
A close cousin is wait_for_downstream, which extends the wait to every downstream task of the previous run. You almost never want this unless you have a specific reason — usually a fan-out into multiple sinks where partial completion is dangerous.
Idempotency — the load-bearing rule
If you take only one thing from this post into your interview, take this: a DAG run must produce the same result whether it runs once or ten times. That property is idempotency, and backfill is the operation that tests it. A non-idempotent DAG appears to work in production because every interval only runs once. Backfill it and you discover the truth.
Load-bearing trick: every write should be keyed to data_interval_start and should overwrite, not append. DELETE WHERE date = ds then INSERT, or MERGE, or INSERT OVERWRITE PARTITION on Hive/Iceberg/Delta.
A bare INSERT INTO ... SELECT ... is the canonical anti-pattern. Run it twice and you have twice the rows. The fix in Postgres or Snowflake looks like this:
def aggregate(**context):
ds = context["data_interval_start"].strftime("%Y-%m-%d")
cur.execute(
"DELETE FROM agg_daily_revenue WHERE event_date = %s",
(ds,),
)
cur.execute(
"""
INSERT INTO agg_daily_revenue (event_date, country, revenue_usd)
SELECT event_date, country, SUM(amount_usd)
FROM raw_orders
WHERE event_date = %s
GROUP BY 1, 2
""",
(ds,),
)On Snowflake or BigQuery the same intent is usually expressed with MERGE. On Hive-style storage it is INSERT OVERWRITE PARTITION (event_date='{{ ds }}'). On Iceberg or Delta you can lean on MERGE INTO with the partition predicate. The shape varies; the rule does not.
The cleanest pipelines treat every partition as a function of data_interval_start — pure, replayable, and explicit about what it owns.
data_interval_start and data_interval_end
Airflow 2.2 renamed execution_date to a pair of timestamps that describe the interval the task is responsible for, not the moment the task runs. This is one of the most-asked clarifications in DE interviews because the naming was historically confusing.
| Macro | Meaning | Daily DAG example for run on 2026-05-08 00:00 |
|---|---|---|
data_interval_start |
Start of the interval the run processes | 2026-05-07 00:00 |
data_interval_end |
End of the interval (exclusive) | 2026-05-08 00:00 |
logical_date |
Legacy execution_date, kept for compatibility |
2026-05-07 00:00 |
ts / ds |
Convenience formats of logical_date |
2026-05-07T00:00:00+00:00 / 2026-05-07 |
The key intuition: a daily DAG run that starts at midnight UTC on 2026-05-08 is processing yesterday's data, because yesterday is the interval that just closed. data_interval_start is the partition you should be writing to, and logical_date exists mostly so that older operator code keeps working.
Backfill strategies compared
Not every recovery needs the same approach. The table below maps four common scenarios to the right tool — interviewers like candidates who pick deliberately rather than reflexively reaching for --reset-dagruns.
| Scenario | Strategy | Flags / approach | Risk profile |
|---|---|---|---|
| New DAG, want last 90 days filled | Manual backfill, capped concurrency | airflow dags backfill -s ... -e ... --max-active-runs 4 |
Low — DAG is idempotent by design |
| One transformation logic changed | Backfill one task only | --task-regex 'agg_revenue' |
Low if downstream is independent |
| Partition has corrupt data | Reset specific runs | --reset-dagruns over the bad window |
Medium — overwrites must be safe |
Stateful pipeline with depends_on_past |
Sequential backfill, fix-forward on failure | Default, expect serial execution | High — one bad day stalls the chain |
The pattern: pick the smallest blast radius that solves the problem. Reaching for --reset-dagruns across a quarter of history when you only changed one column is the kind of move that gets flagged in a post-incident review.
Common pitfalls
The first trap is leaving catchup=True on a brand-new DAG with a start_date several months in the past. The moment you unpause it the scheduler tries to launch every missed interval at once, saturates your pool, and either OOMs your workers or floods downstream warehouses with concurrent writes. The fix is catchup=False on new DAGs as a project convention, with manual backfill when history actually matters.
The second trap is non-idempotent writes. An INSERT INTO without a corresponding DELETE or MERGE looks fine on the green-field run and silently doubles your rows on every replay. You will not notice until the finance team flags a 200% revenue spike on a backfilled month. The fix is to make every task partition-keyed on data_interval_start and to overwrite that partition rather than append.
The third trap is ignoring depends_on_past for stateful pipelines. If today's row references yesterday's closing balance and you backfill out of order — or worse, in parallel — you get inconsistent state. The fix is to mark the task depends_on_past=True and to accept the serial cost in exchange for correctness.
The fourth trap is backfilling the entire DAG when only one task changed. This wastes compute and can re-trigger expensive downstream steps like model training or vendor API calls. The fix is --task-regex plus --ignore-dependencies once you have verified upstream is healthy, scoped to exactly the tasks whose logic moved.
The fifth trap is relying on execution_date in user code. In Airflow 3 it is fully deprecated, and even in 2.x its semantics confuse readers. The fix is to use data_interval_start for the data you are processing and logical_date only when interoperating with older operators that still expect it.
Related reading
- What is Apache Airflow
- Apache Iceberg deep dive — DE interview
- MERGE and UPSERT — DE interview
- SQL for Data Engineer interview
If you want to drill Data Engineering questions like these every day, NAILDD is launching with hundreds of interview problems across exactly this pattern.
FAQ
What is max_active_runs and why does it matter for backfill?
max_active_runs caps the number of DAG runs that can be in a non-final state at the same time. The default is 16 on DAG level. During a wide backfill that limit becomes the natural throttle — runs queue and process serially in groups. Lower it explicitly (say to 2 or 4) when your backfill window hits an external system with strict rate limits, like a SaaS API or a small Postgres replica.
Can I backfill in production without downtime?
Yes, if every task is idempotent and partition-keyed. The overwrite happens atomically per partition, readers see either the old or the new version, and there is no global lock. Without idempotency you usually need either a maintenance window or a write to a shadow table followed by an atomic swap. Most teams converge on idempotency precisely because it removes the downtime question entirely.
Should I use catchup=True or catchup=False by default?
Default to catchup=False for new DAGs. It removes the "scheduler avalanche on first unpause" failure mode, and it makes the team's mental model simple — "the DAG starts running from now". When you actually want history, run an explicit airflow dags backfill with a controlled date range and capped concurrency. That is also easier to review in a PR than a config flag whose blast radius depends on start_date.
How is backfill different in Airflow 3 vs 2.x?
Airflow 3 introduces a first-class backfill object with its own ID, lifecycle, and UI surface. You create one with airflow backfill create, monitor it, and cancel it cleanly without touching DAG runs directly. The semantics of catchup and depends_on_past are unchanged, and the data_interval macros are the same. If you are coming from 2.x, the mental upgrade is "backfill is no longer a side effect of run creation — it is its own thing".
Does backfill respect Airflow pools and queues?
Yes. Backfill DAG runs go through the same scheduler, executor, pools, and queues as regular runs. That is why an unbounded backfill can starve your other DAGs — they share the pool. The two main controls are max_active_runs on the DAG and slot counts on the pool. In practice senior DEs combine both: small DAG-level concurrency for the backfilling DAG plus a dedicated pool for heavy historical jobs.
When should I rebuild vs backfill?
If the schema changed in a backward-incompatible way — a column was renamed, a join key shifted, partitioning changed — rebuild a new table and swap. Backfill is for cases where the data shape is stable and only the values need to be recomputed. The rule of thumb: if your downstream consumers would not notice the difference between a rebuild and a backfill, choose backfill for the simpler operational story; otherwise pay the migration cost up front.
Is this official Airflow documentation?
No. This is interview-oriented synthesis based on Airflow 2.7+ and 3.0 behavior. For authoritative semantics, consult the official Apache Airflow docs for your exact version.