Apache Airflow in the Data Engineering interview

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Why Airflow shows up in DE interviews

Apache Airflow is the de facto orchestrator at most modern data platforms — Airbnb (where it was born), Stripe, DoorDash, Snowflake, and Databricks all run it in some form. If you are interviewing for a Data Engineer role at a company with a real warehouse, Airflow knowledge is table stakes: what a DAG is, how dependencies are wired, why idempotency matters.

The deeper questions separate juniors from mid-level candidates: how the scheduler parses DAG files, when a sensor is the wrong choice, how to design a pipeline that survives a partial failure and a Monday-morning rerun. Hiring managers care less about reciting operator names and more about reasoning about state, retries, and side effects under load. The code below stays short on purpose; the interesting part is the conversation around it.

DAGs, operators, dependencies

A DAG (Directed Acyclic Graph) is the unit of orchestration: a set of tasks plus the edges between them. The shape must stay acyclic — if transform depends on extract and extract somehow depends on transform, the scheduler refuses to run the DAG. Three concepts you should be able to define cold:

  • Operator — the template for what a task does (PythonOperator, BashOperator, KubernetesPodOperator, SnowflakeOperator, etc.).
  • Task — a concrete instance of an operator inside a DAG.
  • Dependency — the order between tasks, declared with >> / << or set_upstream / set_downstream.

A minimal DAG looks like this:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(
    "simple_etl",
    start_date=datetime(2026, 1, 1),
    schedule="@daily",
    catchup=False,
) as dag:

    extract = PythonOperator(task_id="extract", python_callable=extract_fn)
    transform = PythonOperator(task_id="transform", python_callable=transform_fn)
    load = PythonOperator(task_id="load", python_callable=load_fn)

    extract >> transform >> load

Interviewers like to probe the DAG-level arguments. Know what each of these does and the trade-offs:

Argument What it controls Common interview trap
schedule Cron expression or preset (@daily, @hourly, None) Confusing schedule with execution_date — the run for 2026-05-17 00:00 actually starts at the end of that interval.
catchup Whether the scheduler backfills from start_date on first deploy Leaving it as the default and triggering thousands of historical runs by accident.
max_active_runs Concurrent DAG runs allowed Setting it too high causes contention on shared warehouse credits.
max_active_tasks Concurrent task instances per DAG Often confused with task_concurrency (per-task across runs).
default_args.retries Retry count per task Putting non-deterministic side effects in a task that retries 3 times.

The operator landscape is also fair game:

Operator Best fit When to avoid
PythonOperator Glue code, calling a library Heavy CPU/GPU work — it blocks a worker slot.
BashOperator spark-submit, dbt run Anything with secrets in the command string.
KubernetesPodOperator Isolated heavy jobs with their own image Tiny tasks where pod startup dwarfs the work.
SQLExecuteQueryOperator Idempotent SQL against Snowflake/Postgres/BigQuery Multi-statement scripts that need cross-operator transactions.
TriggerDagRunOperator Cross-DAG orchestration Replacing what should be a single DAG with a chain of triggers.

Idempotency and backfill

Idempotency is the single most load-bearing concept in Airflow. A task is idempotent if running it twice with the same parameters produces the same end state. Without idempotency, retries corrupt your data and backfill becomes a liability instead of a tool.

The classic non-idempotent task:

-- bad: a retry produces duplicate rows
INSERT INTO events_agg (DATE, count)
VALUES ('2026-05-17', 10000);

Three reliable ways to make it idempotent — delete-then-insert on the partition key, a MERGE upsert, or INSERT OVERWRITE PARTITION:

-- option 1: delete-then-insert on the partition key
DELETE FROM events_agg WHERE DATE = '{{ ds }}';
INSERT INTO events_agg (DATE, count)
SELECT '{{ ds }}', COUNT(*) FROM events WHERE event_date = '{{ ds }}';

-- option 2: partitioned table with INSERT OVERWRITE
INSERT OVERWRITE TABLE events_agg PARTITION (DATE='{{ ds }}')
SELECT COUNT(*) FROM events WHERE event_date = '{{ ds }}';

Backfill is the operation that reruns a DAG for past execution_date values. You backfill when you fix a bug and need to regenerate downstream data, when a DAG was paused for a week, or when you ship a new pipeline and need to populate history. The CLI is straightforward:

airflow dags backfill -s 2026-01-01 -e 2026-01-31 my_dag

Load-bearing trick: never backfill a DAG that contains a non-idempotent task. The first run produces correct data, every rerun produces duplicates, and you spend the rest of the day reconciling. If a colleague asks you to backfill, your first question should be "is every task idempotent on {{ ds }}?"

For a deeper walk-through of backfill semantics and the edge cases interviewers love (gaps, partial reruns, schema drift), see the companion piece on Airflow backfill in the Data Engineering interview.

XCom and Jinja templates

XCom ("cross-communication") is how tasks pass small values to each other. The default metadata backend is your Airflow database, which means anything stored in XCom is serialized JSON and capped by the column size — practically, under a megabyte, and you want to keep individual values well below that.

def push_value(**context):
    context["ti"].xcom_push(key="user_count", value=12345)

def pull_value(**context):
    count = context["ti"].xcom_pull(task_ids="push_task", key="user_count")
    print(f"Got {count} users")

For larger payloads — a Pandas DataFrame, a model artifact, a list of millions of IDs — write the object to S3 or GCS and push only the path through XCom. A custom XCom backend can hide this for you, but the contract is the same: small references, not large blobs.

Jinja templating is built into most operators. The macros let you parameterize SQL paths and file locations by the run's logical date:

Macro Value Typical use
{{ ds }} 2026-05-17 Partition filter in SQL: WHERE date = '{{ ds }}'
{{ ds_nodash }} 20260517 S3 prefixes, Hive-style paths
{{ ts }} ISO timestamp Logging, audit columns
{{ macros.ds_add(ds, -7) }} 7 days before Rolling-window SQL
{{ data_interval_start }} Start of the interval New-style replacement for execution_date

A short Bash example that templates the date into a Spark job:

load = BashOperator(
    task_id="load",
    bash_command="spark-submit /jobs/etl.py --date={{ ds }}",
)
Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

Sensors and how to use them safely

A sensor is a task that waits for a condition — a file appearing on S3, a row landing in a control table, an external DAG finishing — and then completes. Sensors are useful when an upstream system is genuinely outside your control. They are also the single biggest cause of cluster wedging in Airflow deployments.

from airflow.sensors.filesystem import FileSensor

wait_file = FileSensor(
    task_id="wait_for_input",
    filepath="/data/input/{{ ds }}.csv",
    poke_interval=60,
    timeout=3600,
    mode="reschedule",
)

Gotcha: the default mode="poke" holds a worker slot for the entire wait. A sensor with a 24-hour timeout in poke mode will keep one slot pinned all day. Multiply that by ten DAGs and your cluster starves. Use mode="reschedule" so the sensor frees its slot between checks, or use a deferrable operator (Airflow 2.2+) backed by the triggerer process for nearly free waiting.

The other rule worth memorizing: if the upstream system can emit an event (Kafka message, S3 event notification, a webhook), prefer event-driven scheduling — Airflow 2.4+ datasets, or a small webhook handler that triggers the DAG — over polling. Polling is the cheapest thing to write and the most expensive thing to run.

Typical scenario questions

Interviewers rarely ask "define a DAG." They give you a scenario and watch how you reason. Four prompts that come up over and over:

1. Design a daily ETL DAG. "Every morning we pull yesterday's transactions from a partner API, normalize them, and load into our Snowflake warehouse. Sketch the DAG." A solid answer covers the three-stage skeleton (extract → transform → load), the idempotency strategy (INSERT OVERWRITE on the daily partition keyed by {{ ds }}), error handling (retries=3, exponential backoff, on_failure_callback to a Slack webhook), and an SLA so paging fires if the DAG runs past 09:00 UTC.

2. Debug a broken DAG. They paste 40 lines of Python with a planted bug. The classics: a cycle in >> declarations, a schedule string that doesn't parse, a non-idempotent SQL statement inside a task that retries, a race between two parallel branches writing to the same table. Read the code top to bottom before guessing.

3. Speed up a slow DAG. "Our 50-task DAG takes four hours; cut it in half." Walk through parallelizing independent branches, grouping micro-tasks into a TaskGroup or a single operator (Airflow overhead per task is real — roughly one to three seconds), offloading heavy work to KubernetesPodOperator so the scheduler isn't bottlenecked, and profiling the actual slow task before optimizing the wrong thing.

4. Design monitoring. Which metrics matter, where they come from, what triggers a page. Expect to mention scheduler heartbeat lag, DAG parse time, per-task duration percentiles, retry rates, and SLA misses; expect to mention StatsD and a Prometheus scrape, plus a dashboard split by team-owned DAGs. The ability to name two metrics and explain why they page is worth more than naming twenty.

Common pitfalls

Heavy code at the top of a DAG file. The scheduler parses every DAG file on a tight loop — by default every 30 seconds — to detect changes. If your DAG file opens a database connection or pulls a large config from S3 at module load, you pay that cost every parse cycle and eventually wedge the scheduler. Keep the top of the file declarative and move expensive work inside python_callable so it runs at task execution, not at parse time.

Treating module-level variables as state. Engineers new to Airflow sometimes assign a variable at the top of the DAG file and read it inside a task. Because the scheduler reparses constantly and tasks run on different workers, there is no shared in-process state between parse and execution. Use Airflow Variables, Connections, or an external store like Redis instead.

Sensors in poke mode with long timeouts. Already covered, but it bears repeating because it sinks more clusters than any other single mistake. The symptom is innocuous — DAGs queue up, tasks sit in running forever — and the cause is one team that wrote a 12-hour FileSensor in poke mode and pinned half the worker pool. Audit your sensors and switch to reschedule or deferrable.

Pushing large payloads through XCom. A million-row DataFrame will technically serialize, will balloon your metadata database, and will slow every DAG because the UI loads XCom on the task instance page. XCom carries pointers, not payloads — anything bigger than a few kilobytes goes to object storage, with the path in XCom.

Forgetting catchup=False on a backdated DAG. You ship a new DAG with start_date=datetime(2020, 1, 1), leave catchup at its default True, unpause the DAG, and the scheduler tries to run six years of historical dates in parallel. Your warehouse gets hammered and on-call gets paged. Pick a recent start_date, set catchup=False unless you specifically need it, and commit both to muscle memory.

If you want to drill Airflow scenarios and SQL patterns like these every day, NAILDD is launching with hundreds of DE-flavored problems built around exactly this kind of design conversation.

FAQ

Which Airflow version should I prepare on?

Target Airflow 2.x, ideally 2.7 or newer. The 1.x line is end-of-life. The features interviewers probe in 2.x are the TaskFlow API (decorator-based authoring with implicit XCom), dynamic task mapping (expand / partial for fan-out), datasets for event-driven scheduling, and deferrable operators for cheap waiting. If you have 3.x available, the concepts transfer cleanly.

Airflow vs Dagster vs Prefect — which one to learn?

Airflow is the answer for interview prep because it is what most companies actually run. Dagster has a stronger asset-oriented model and better local dev ergonomics; Prefect 2 has a cleaner Python-first API. You will encounter Dagster at some Snowflake-adjacent shops and Prefect at a handful of startups, but the gravitational center of the job market is still Airflow. Learn it first, mention the others if asked.

How much hands-on practice does a junior need?

One or two end-to-end projects with a non-trivial DAG: a daily pull from a public API, a transformation step in Python or dbt, a load into a free-tier Snowflake or Postgres, plus a sensor, an XCom, and at least one demonstrably idempotent SQL task. Run it for a few weeks so you have stories about retries, schema drift, and one real outage. Lived experience beats theoretical answers in the interview.

Is this an official Apache Airflow guide?

No. This is an interview-prep digest based on public Airflow documentation and patterns we see come up with candidates interviewing for Data Engineer roles. Always check the official Airflow docs for the version your target company runs — deferrable operators and dataset scheduling have changed between minor releases.