What is Apache Airflow

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

What Airflow actually is

Apache Airflow is a workflow orchestrator. It decides what runs, in what order, on what schedule, and what to do when something breaks. The unit of work is a DAG — a directed acyclic graph of tasks — and the runtime job is to walk that graph, respect dependencies, retry failures, and surface logs in a UI you can stare at on a Monday morning.

Picture any data team's daily routine: pull yesterday's orders from the production database, clean them, push them into Snowflake, refresh the executive dashboard. Without an orchestrator you have four scripts wired to cron, the second silently fails on a schema change, the next two run on stale data, and the VP of Finance asks why MRR dropped 12% overnight. Airflow's pitch: never let step 3 run if step 2 didn't succeed, and tell a human within minutes when something is off.

Analyst interviews at Stripe, Airbnb, DoorDash, and Snowflake increasingly assume you can read a DAG. Nobody expects you to write one from scratch on day one — but if "we use Airflow for our ETL" hits a system-design question and you blink, that's a signal.

The problem Airflow solves

Picture the pre-Airflow world. Twenty Python scripts run on cron, and nobody remembers which depends on which. One falls over because a column was renamed in production. The next runs anyway and produces a partial table. The dashboard shows yesterday's revenue at $4,200 instead of $420,000 because a join silently dropped 99% of rows. Logs live on a box you SSH into. Retries are a Slack message at 7am saying "can someone rerun the loader."

Airflow replaces that chaos with one system giving you five things:

  • Dependencies — task B never starts until task A succeeds.
  • Scheduling — cron expressions, calendar intervals, or data-aware triggers.
  • Retries — declarative, per-task: three tries with exponential backoff before alerting.
  • Observability — a web UI with every run, every task, every log line, color-coded.
  • Alerting — Slack, PagerDuty, email on failure or SLA miss.

The mental model that matters: Airflow doesn't do the work — your Python, SQL, dbt, or Spark job does. Airflow decides when and whether the work runs. Mixing those two responsibilities is the most common architectural mistake junior engineers make.

Core concepts: DAG, task, operator

DAG (Directed Acyclic Graph)

A DAG describes your pipeline. Directed means tasks flow in one direction. Acyclic means no task can eventually depend on itself. In code, a DAG is a Python file in dags/ that Airflow's scheduler parses every few seconds.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

dag = DAG(
    "daily_etl",
    schedule_interval="0 3 * * *",   # every day at 03:00 UTC
    start_date=datetime(2026, 1, 1),
    catchup=False,
)

The catchup=False is load-bearing — leave it on True by accident and Airflow will try to backfill every missed interval since start_date, which can mean thousands of historical runs the moment you deploy.

Task

A task is a single unit of work: run a SQL query, call an API, copy a file to S3, kick off a Spark job. Tasks chain with the >> operator, which reads like an arrow:

extract >> transform >> load >> notify

If transform fails, load and notify never start — they're marked upstream_failed in the UI, which is far more useful than the cron equivalent (silence, then surprise).

Operator

An operator is a template for a task. Instead of writing boilerplate to connect to Postgres or upload to S3 every time, you instantiate a pre-built operator. The ones an analyst sees most often:

  • PythonOperator — runs a Python callable.
  • BashOperator — executes a shell command.
  • PostgresOperator / SnowflakeOperator / BigQueryOperator — runs SQL against a connection.
  • S3ToRedshiftOperator — copies a file from S3 into Redshift.
  • DbtCloudRunJobOperator — triggers a dbt Cloud job.

Airflow's providers ecosystem ships hundreds more — Databricks, Kubernetes, Docker, Slack, Datadog. If something has an API, somebody has built an operator for it.

A daily ETL example

The cliché — because every company has it — is the nightly orders pipeline. Extract new orders from the production replica, compute daily aggregates, land them in the warehouse, ping a channel when done.

from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.postgres.operators.postgres import PostgresOperator
from datetime import datetime, timedelta

default_args = {
    "retries": 2,
    "retry_delay": timedelta(minutes=5),
}

with DAG(
    "daily_orders_etl",
    default_args=default_args,
    schedule_interval="0 3 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
) as dag:

    extract = PostgresOperator(
        task_id="extract_orders",
        postgres_conn_id="prod_db",
        sql="""
            INSERT INTO staging.raw_orders
            SELECT *
            FROM orders
            WHERE created_at >= '{{ ds }}'
              AND created_at <  '{{ next_ds }}'
        """,
    )

    transform = PostgresOperator(
        task_id="build_daily_stats",
        postgres_conn_id="warehouse",
        sql="""
            INSERT INTO mart.daily_order_stats
            SELECT
                DATE(created_at)   AS order_date,
                COUNT(*)           AS order_count,
                SUM(amount)        AS revenue,
                AVG(amount)        AS avg_order_value
            FROM staging.raw_orders
            WHERE DATE(created_at) = '{{ ds }}'
            GROUP BY DATE(created_at)
        """,
    )

    notify = PythonOperator(
        task_id="send_alert",
        python_callable=lambda: print("ETL finished"),
    )

    extract >> transform >> notify

The {{ ds }} and {{ next_ds }} are Jinja templated macros — Airflow substitutes the logical date of the run in YYYY-MM-DD. After DAGs, this is the most important Airflow concept: every task knows what date it is processing, not what wall-clock time it ran. That's what makes backfills idempotent — rerun February 14th six months from now and you get the same rows you would have gotten that morning.

Airflow vs cron vs the alternatives

The first question every team asks is do we even need this? Cron is one line in a crontab. Airflow is a scheduler, metadata database, worker pool, and web server. When does the complexity pay off?

cron Airflow Prefect Dagster
Task dependencies None Yes, declared in DAG Yes, Pythonic flow Yes, asset-based
Retries None Per-task, configurable Per-task, configurable Per-op, configurable
Logs & monitoring A file, if you set it Web UI, per-task Cloud UI, per-run UI focused on assets
Backfill Manual Built-in Built-in Built-in via partitions
Mental model Time-based Task graph Functional Python flow Data assets
Best fit 1-3 simple jobs 50+ pipelines, multi-team Python-heavy startups Analytics teams
Hosted option No Astronomer, MWAA, Composer Prefect Cloud Dagster+
Year-1 ops cost (rough) $0 $25k-$60k self-hosted $5k-$30k cloud $10k-$40k

Cron works for one or two simple scripts. Past five jobs with real dependencies, cron's silent failures cost more than an orchestrator. That tipping point shows up earlier than most teams expect.

Prefect is Python-native with less boilerplate, popular at smaller startups. Dagster flips the model — you describe the assets you want (tables, files, ML models) rather than the tasks producing them, which fits analytics-first teams. dbt Cloud isn't a general orchestrator but pairs neatly with Airflow: Airflow handles the E and L, dbt handles the T.

Train for your next tech interview
1,500+ real interview questions across engineering, product, design, and data — with worked solutions.
Join the waitlist

When an analyst actually touches Airflow

Most analysts don't write DAGs from blank pages — data engineering owns the infrastructure. But four scenarios show up constantly:

"Why is yesterday's data missing?" — Did the ETL run? Open the Airflow UI, find the DAG, look at the run grid. Red means a task failed and you need the log. Light gray means it didn't trigger at all. Knowing this saves asking an engineer who already has a ticket open.

"We need a new mart table." — You wrote the SQL for a new aggregated table. Adding a PostgresOperator task is fifteen lines. Submit that PR rather than waiting two sprints and you ship 4-6x faster.

"The numbers came in late." — Airflow shows the duration of every task. If extract_orders ran for 3 hours instead of the usual 10 minutes, the production replica was under load. You see it in the Gantt chart, not by guessing.

Interview signal. — Senior analyst and analytics engineer loops lean on system understanding. "Walk me through how a number on a dashboard gets there" expects you to mention ingestion, warehouse, transformations, scheduling, monitoring. "Airflow runs a DAG nightly that ingests from Postgres, transforms with dbt, and refreshes the BI extract" is the kind of answer that closes loops.

Interview questions

The Airflow questions that come up at Snowflake, Databricks, Stripe, and the analytics engineering tracks at Meta and Netflix are consistent.

1. What is Airflow and what problem does it solve? A workflow orchestrator managing order, schedule, retries, and observability of data pipelines. It replaces cron-script tangles with one system that knows dependencies, alerts on failure, and provides a UI.

2. What is a DAG? A directed acyclic graph of tasks: directed because execution flows one way, acyclic because tasks cannot depend on themselves. In Airflow, the DAG is a Python file the scheduler parses each interval.

3. How is Airflow better than cron? Cron has no dependencies, retries, or observability. Airflow gives task-level dependencies, automatic retries with backoff, a metadata database, a UI for logs, and built-in backfill.

4. Empty data on yesterday's dashboard — what do you check? Find the DAG, check whether the last run succeeded. If a task is red, read the log. If the DAG never ran, check schedule, pause state, and upstream data arrival.

5. What is backfill? Re-running a DAG for past intervals. If a pipeline didn't run for three days, backfill replays the DAG once per missed interval with the correct {{ ds }} each time.

6. execution_date vs actual start time? The logical date is the data interval the run is responsible for; the start time is when the scheduler kicked it off. Different by design, so backfills produce identical results to live runs.

Common pitfalls

The most expensive Airflow mistake is putting logic in the DAG file itself. Anything at the top level of a DAG module is parsed every few seconds by the scheduler — open a database connection up there and you've created a DoS against your own metadata DB. Keep DAG files declarative; computation, queries, and API calls belong inside a task callable.

A second trap is misunderstanding execution date. Airflow runs a DAG after the interval it covers — the run dated 2026-05-17 actually starts on the 18th. New users wire {{ ds }} expecting "today" and get confused when the run pulls "yesterday's data." Each run owns its data interval.

Third, catchup left on by default has bricked more proof-of-concepts than any other foot-gun. Deploy a DAG with start_date=datetime(2024, 1, 1) and catchup=True, and Airflow tries to run 500+ historical executions the moment the scheduler picks it up. Set catchup=False unless you are deliberately backfilling.

A fourth pitfall is treating Airflow as a data movement engine. It is a scheduler, not a Spark cluster. Loading 10 GB into a Python operator and transforming in pandas inside the worker will OOM the worker. Push the work to where the data lives.

Finally, alert fatigue. Wiring every task to Slack-on-failure means nobody reads the channel within a month. Use SLAs for runs that matter, route critical failures to PagerDuty, and let the rest land in a digest. A dashboard nobody watches is worse than no dashboard.

If you want to drill the SQL and data-pipeline questions that come up in analyst and analytics-engineer interviews, NAILDD is launching with 500+ problems mapped to exactly this kind of system-design loop.

FAQ

Do analysts need to write DAGs from scratch?

For junior or mid analyst roles, no — understanding the concepts (DAG, task, operator, schedule, retries, backfill, logs) is enough. For senior, staff, or analytics engineering roles, yes: the ability to ship a PR that adds a new SQL task to an existing DAG separates analysts from analytics engineers in compensation bands. The pay gap between the two titles at major US tech companies is often $30k-$60k total comp.

Where do I practice without a production environment?

Run Airflow locally with the official docker-compose.yaml from the Apache Airflow docs — it spins up scheduler, webserver, and a Postgres metadata DB in about ten minutes. Build one DAG that reads a CSV and writes to a local Postgres, then a second DAG that depends on the first. That covers ~80% of what an interview probes.

Airflow or dbt — which matters more for an analyst?

Different problems. Airflow orchestrates when and in what order things run. dbt transforms what the data looks like once in the warehouse. Most modern stacks use both. If you can only learn one, learn dbt first — it is closer to daily SQL work.

Is Airflow still the standard in 2026?

Yes, with caveats. Airflow remains the most widely deployed orchestrator at companies above ~200 employees. Prefect and Dagster have real traction, especially Dagster at analytics-first teams. Interviewers still default to Airflow because its abstractions generalize to every other orchestrator on the market.

Can Airflow handle real-time pipelines?

Not really. Minimum scheduling granularity is the one-minute mark, and even that strains the scheduler. For sub-minute or event-driven streams, Kafka plus a stream processor (Flink, Spark Structured Streaming, Materialize) fits. Airflow is for batch and micro-batch.