Apache Airflow in 60 minutes: schedule your first data pipeline
Install Apache Airflow 3 locally with uv, write a DAG with the TaskFlow API, wire Python plus Bash plus SQL operators, schedule it, and monitor in the webserver.
Image: apache/airflow GitHub repository, used for editorial coverage of the open-source workflow orchestrator this tutorial builds against.
What this gets you in 60 minutes
A working local Apache Airflow 3 install, a single DAG that pulls a CSV, transforms it with a Python task, writes the result back to SQLite via a SQL task, and runs on a daily schedule that you watch in the webserver UI at http://localhost:8080. The official Quick Start ships exactly one command (airflow standalone) that bootstraps the metadata database, creates an admin user, and starts the scheduler + webserver together. 2 By the end you can swap the demo CSV for a real source (your warehouse, an S3 bucket, a public API) without rewriting the DAG’s structure.
This tutorial targets Apache Airflow 3.2.1, the current stable release as of 2026-05-19. 9 Airflow 3 introduced a stable DAG-authoring namespace at airflow.sdk; the import patterns here use it directly. 6 Pre-3.0 DAGs that import DAG and task from internal modules still work but are deprecated. 10 Reader prerequisites: Python 3.10 or newer, a terminal, and 20 minutes of patience for the first install. No prior Airflow experience assumed.
Why Airflow, not cron + a shell script
Cron schedules a script. Airflow schedules a graph of tasks with dependencies, retries, backfills, and a UI you can hand to a teammate. The pitch from the project: “Airflow is a platform created by the community to programmatically author, schedule and monitor workflows.” 1 That sentence carries weight only if your pipeline has more than one step and you care about what happens when step two fails.
A two-line cron entry runs a Python script every night at 2am. That works until the script fails halfway through, leaves the database in a partial state, and no one knows until morning. Airflow’s value shows up exactly there: each task is a re-runnable unit with its own log, the DAG declares which tasks must finish before which others start, the UI shows which task failed on which date, and retries happen automatically with backoff. The cost is operational complexity: a metadata database, a scheduler process, and a webserver to keep running.
Alternatives worth naming: Prefect and Dagster are the two modern competitors. Prefect leans toward Python-first ergonomics for data scientists; Dagster leans toward asset-graph modelling for data engineers who want to think in tables rather than tasks. Airflow remains the incumbent and the one most production data teams encounter first, because it has the longest connector inventory, the largest community, and an enormous installed base on managed offerings (AWS MWAA, Google Cloud Composer, Astronomer). Pick Airflow when “what does the rest of the team already know” outweighs “what’s cleanest on a greenfield project”.
Time required
About 60 minutes if you copy-paste as you read. Roughly: 10 minutes for Step 1 (install via uv or pip), 5 for Step 2 (airflow standalone and the first webserver login), 10 for Step 3 (writing a Hello-World DAG), 10 for Step 4 (TaskFlow API), 10 for Step 5 (BashOperator + SQL), 10 for Step 6 (scheduling and the webserver tour), and a buffer for the inevitable port conflict on 8080.
Steps
1. Install Airflow with uv (preferred) or pip
Apache Airflow has unusual install requirements because it pins a constraint file per Python + Airflow version, to keep the dependency graph reproducible. The Quick Start documents this with a constraints URL passed to pip. 2 The same constraint file works with uv, Astral’s fast package manager, which is what we use here because it cuts the install from minutes to seconds on most machines. 11
# 1a. Install uv if you don't already have it.
curl -LsSf https://astral.sh/uv/install.sh | sh
# 1b. Create a fresh project directory.
mkdir airflow-tutorial && cd airflow-tutorial
export AIRFLOW_HOME="$(pwd)/airflow"
# 1c. Create a virtual environment with uv.
uv venv --python 3.11
source .venv/bin/activate
# 1d. Install Airflow with the constraint file for your Python version.
AIRFLOW_VERSION=3.2.1
PYTHON_VERSION="$(python -c 'import sys; print(f"{sys.version_info.major}.{sys.version_info.minor}")')"
CONSTRAINT_URL="https://raw.githubusercontent.com/apache/airflow/constraints-${AIRFLOW_VERSION}/constraints-${PYTHON_VERSION}.txt"
uv pip install "apache-airflow==${AIRFLOW_VERSION}" --constraint "${CONSTRAINT_URL}"
Three details to flag. First, AIRFLOW_HOME is the directory Airflow uses for its config (airflow.cfg), DAG folder (dags/), and metadata SQLite file (airflow.db). Pointing it at the project folder keeps everything self-contained; without that export, Airflow defaults to ~/airflow which is fine but mixes state from any other Airflow project on your machine. Second, the constraint file pins every transitive dependency for this specific Airflow + Python combo; skipping it works most of the time and breaks in odd ways the day a sub-dependency releases a backwards-incompatible patch. Third, the install pulls 200+ packages, so expect 30 seconds with uv on a fast connection, several minutes with pip on a slow one.
If you prefer pip, the only change is pip install instead of uv pip install; everything else stays identical.
Image: apache-airflow on PyPI, used for editorial coverage of the package this install step uses.
2. Boot the standalone Airflow instance
airflow standalone is the single command the Quick Start recommends for local development. It initialises the metadata DB, creates an admin user, and launches the scheduler + webserver in one process. 2
airflow standalone
On first run you see a wall of log output. Two lines matter. One reads Airflow is ready followed by the webserver URL (http://localhost:8080). The other is the admin password, printed once to the terminal, format admin:<a-random-string>. The same password is written to $AIRFLOW_HOME/simple_auth_manager_passwords.json.generated if you scroll past it. The username is always admin for the standalone profile.
Open http://localhost:8080 in a browser, log in with admin plus the generated password, and you land on the DAG list. Out of the box the list shows the example DAGs Airflow ships with: example_bash_operator, tutorial_taskflow_api_etl, a handful more. These are convenient as references but pollute the UI for a real project. Edit $AIRFLOW_HOME/airflow.cfg, find the load_examples = True line, change it to load_examples = False, then restart airflow standalone (Ctrl-C, run again). The example DAGs disappear from the UI.
Leave the airflow standalone terminal running for the rest of the tutorial. Each DAG file change is auto-detected by the scheduler within about 30 seconds (controlled by dag_dir_list_interval in airflow.cfg); no restart needed unless you change config or install a new provider package.
3. Write a Hello-World DAG with the classic operator API
A DAG (directed acyclic graph) is a Python file that defines a graph of tasks. Airflow scans $AIRFLOW_HOME/dags/ for .py files containing the substrings airflow and dag (case-insensitive), and tries to import each one. Anything that defines a DAG object gets registered.
Create the dags folder and the first DAG file.
mkdir -p $AIRFLOW_HOME/dags
Save the following as $AIRFLOW_HOME/dags/hello_dag.py.
from datetime import datetime, timedelta
from airflow.sdk import DAG, task
default_args = {
"owner": "data-team",
"retries": 1,
"retry_delay": timedelta(minutes=2),
}
with DAG(
dag_id="hello_world",
description="First DAG: prints a message and a count",
schedule="@daily",
start_date=datetime(2026, 5, 1),
catchup=False,
default_args=default_args,
tags=["tutorial"],
) as dag:
@task
def say_hello() -> str:
message = "Hello from Airflow"
print(message)
return message
@task
def count_words(text: str) -> int:
n = len(text.split())
print(f"Word count: {n}")
return n
count_words(say_hello())
What each parameter does. dag_id is the unique name Airflow uses in the UI and CLI. schedule="@daily" is one of Airflow’s cron preset shortcuts (others: @hourly, @weekly, @once, plus full cron expressions like "0 2 * * *"). 3 start_date is the earliest logical date the scheduler considers; combined with catchup=False it means “start running from now, do not backfill historical runs”. default_args apply to every task in the DAG unless individually overridden, with retries=1 and retry_delay giving you one automatic retry two minutes after a failure.
The @task decorator is the TaskFlow API: each decorated function becomes a task in the DAG, and the return value of one task can be passed as an argument to another. The line count_words(say_hello()) defines the dependency: count_words runs after say_hello, and its text argument is the return value of say_hello. The plumbing (Airflow’s XCom mechanism for inter-task data passing) is handled invisibly. 4
Save the file. Within 30 seconds the DAG appears in the UI under the name hello_world. Click into it, hit the Play button (top right), pick “Trigger DAG”, and watch the two tasks turn green. Click on either task to see its logs; the print statements are captured.
4. Use the TaskFlow API for a real data flow
The TaskFlow API is Airflow’s modern authoring pattern: write tasks as functions, return values as Python objects, and let Airflow handle serialisation through XCom. The classic pattern (explicit PythonOperator(...) calls with op_kwargs and xcom_push/xcom_pull) still works but is verbose by comparison. 4
Replace hello_dag.py with a small ETL example: extract a CSV from a URL, transform it (filter rows, compute totals), load it into a SQLite table.
from datetime import datetime
from pathlib import Path
import pandas as pd
import sqlite3
from airflow.sdk import dag, task
@dag(
dag_id="daily_sales_etl",
description="Daily sales pipeline: fetch CSV, aggregate, load into SQLite",
schedule="@daily",
start_date=datetime(2026, 5, 1),
catchup=False,
tags=["tutorial", "etl"],
)
def daily_sales_etl():
@task
def extract() -> str:
url = "https://people.sc.fsu.edu/~jburkardt/data/csv/airtravel.csv"
df = pd.read_csv(url)
local_path = "/tmp/airflow_extract.csv"
df.to_csv(local_path, index=False)
return local_path
@task
def transform(input_path: str) -> dict:
df = pd.read_csv(input_path)
df.columns = [c.strip() for c in df.columns]
totals = {
col: int(df[col].sum())
for col in df.columns if col != "Month"
}
return totals
@task
def load(totals: dict) -> int:
db_path = "/tmp/airflow_sales.db"
with sqlite3.connect(db_path) as conn:
conn.execute(
"CREATE TABLE IF NOT EXISTS yearly_totals "
"(year TEXT PRIMARY KEY, passengers INTEGER)"
)
for year, total in totals.items():
conn.execute(
"INSERT OR REPLACE INTO yearly_totals VALUES (?, ?)",
(year, total),
)
conn.commit()
return len(totals)
rows_written = load(transform(extract()))
daily_sales_etl()
The @dag decorator (rather than the with DAG(...) as dag: context manager) is the TaskFlow-style way to declare a DAG, mirroring @task for tasks. The three functions wire into a linear pipeline extract → transform → load by passing return values as arguments. Airflow serialises the return value of extract (the file path) and the return value of transform (the totals dict) through XCom automatically. For small payloads (under a few KB by default) this is convenient and free; for large payloads, write to S3 or a shared filesystem and pass the path, never the dataframe itself.
Trigger the DAG from the UI as before. Each task’s log shows its own activity: extract logs the rows read, transform logs the column totals, load logs the number of rows written. The SQLite file at /tmp/airflow_sales.db is the pipeline’s output.
The example URL is a public CSV from a Florida State University teaching set; swap it for your own data source when you adapt the DAG. The pattern (Python function, return value, next task consumes it) extends to any extract-transform-load shape.
Image: apache/airflow GitHub repository, used for editorial coverage of the DAG authoring concepts this section teaches.
5. Mix BashOperator and SQL into the same DAG
The TaskFlow @task decorator is the cleanest pattern when tasks are pure Python. Real pipelines also call shell scripts (a dbt run, a gsutil copy, a custom CLI tool) and execute SQL against a warehouse. Airflow’s classic operator pattern handles both, and you can mix it with @task in the same DAG.
The BashOperator runs an arbitrary shell command. Add a task that prints a timestamp before the load step finishes. This is a trivial example, but the operator is exactly the same one you would use for dbt run --profiles-dir ... or aws s3 sync s3://... /tmp/....
from airflow.providers.standard.operators.bash import BashOperator
# inside the @dag function, after the existing tasks:
echo_done = BashOperator(
task_id="echo_done",
bash_command="echo 'Pipeline finished at: $(date -u +%Y-%m-%dT%H:%M:%SZ)'",
)
rows_written >> echo_done
The >> operator (right-shift) sets the dependency: echo_done runs after rows_written. The bash command is a regular shell snippet; templated fields like {{ ds }} (the DAG’s logical date in YYYY-MM-DD format) work directly. Airflow 3 moved many of the previously-bundled operators (BashOperator, PythonOperator, EmailOperator) into the apache-airflow-providers-standard provider package, which ships pre-installed with the standard pip install. 5
For SQL, the SQL operators live in the database-specific provider packages (apache-airflow-providers-postgres, apache-airflow-providers-mysql, apache-airflow-providers-snowflake, etc.). The SQLite operator ships in the standard provider for local development. Install a Postgres provider if your warehouse is Postgres:
uv pip install "apache-airflow-providers-postgres"
Then a SQL task looks like this. The conn_id references a connection you create in the Airflow UI under Admin → Connections, storing the host / port / user / password / database name as a single named record. Tasks reference the connection by ID; credentials never appear in the DAG file.
from airflow.providers.postgres.operators.postgres import PostgresOperator
aggregate_yesterday = PostgresOperator(
task_id="aggregate_yesterday",
postgres_conn_id="my_warehouse",
sql="""
INSERT INTO daily_summary (run_date, total_rows)
SELECT '{{ ds }}', count(*) FROM raw_events
WHERE event_date = '{{ ds }}';
""",
)
Two things to flag. The {{ ds }} template variable resolves to the DAG’s logical date at run-time, so the same SQL works whether the task runs today or backfills last Tuesday. And the sql= field accepts either a string (as above) or a path to a .sql file (sql="my_query.sql"), which scales better when queries get long. The query files live in the dags folder alongside the DAG.
6. Schedule it, monitor it, debug it
The schedule field on the DAG is what makes Airflow Airflow rather than a fancier script runner. The canonical patterns:
| Pattern | Meaning |
|---|---|
"@daily" | Once per day at midnight UTC |
"@hourly" | Once per hour at the top of the hour |
"@weekly" | Once per week on Sunday at midnight UTC |
"@once" | Run exactly once when first triggered, then never |
None | Manual triggers only, no automatic schedule |
"0 9 * * 1-5" | Cron: 9am UTC on weekdays |
"0 */6 * * *" | Cron: every six hours |
timedelta(hours=4) | Every four hours after the last run |
The start_date defines the earliest logical date the scheduler considers. With catchup=False, the scheduler only runs from “now” forward, ignoring missed historical dates. With catchup=True, the scheduler creates one DAG run for every interval between start_date and now, useful for backfills, dangerous if your DAG isn’t idempotent. Default to catchup=False until you have a reason to flip it.
The webserver UI surfaces the runtime view. The Grid view (the new default in Airflow 3, replacing the Tree view from 2.x) is a 2-D matrix of DAG runs (rows) by tasks (columns), colour-coded by status: green for success, red for failure, yellow for retry, light-blue for queued, dark-blue for running. 8 Click any cell to see that task instance’s logs, XCom values, and rendered template fields. The Graph view shows the DAG’s task dependencies for a single run; useful for visualising “what happens after what”. The Calendar view is a heatmap of DAG-run success by date, useful for “did this pipeline run every day last month”.
The CLI is the other essential interface. Common commands:
# list all registered DAGs
airflow dags list
# trigger a DAG run manually (logical date defaults to now)
airflow dags trigger daily_sales_etl
# show task status for a specific DAG run
airflow tasks states-for-dag-run daily_sales_etl 2026-05-19
# test a single task without writing to the DB
airflow tasks test daily_sales_etl extract 2026-05-19
The last command is the one to remember: airflow tasks test <dag_id> <task_id> <logical_date> runs a task end-to-end in the foreground, prints logs to your terminal, does NOT record the run in the metadata database. It’s the fastest debug loop for a task that’s failing: tweak the code, rerun airflow tasks test, repeat. No scheduler involved.
Image: astral-sh/uv GitHub repository, used for editorial coverage of the package manager Step 1 recommends.
Common pitfalls
The first pitfall is mistaking a DAG file for a long-running script. The scheduler parses every DAG file in the dags folder on a regular interval, by default every 30 seconds. Code at the top level of the file (outside @dag / @task functions) runs on every parse, including network calls and database queries. A line like df = pd.read_csv("https://...") at the top of a DAG file fetches that URL every 30 seconds, forever, until you move it inside a task. Keep top-level code to imports and DAG definition only.
The second pitfall is large XComs. The TaskFlow API serialises return values through XCom (a metadata-DB table) by default. Returning a 50 MB pandas DataFrame from one task to the next overwhelms the metadata DB and slows the entire scheduler. The rule: return small things (paths, IDs, counts, small dicts). For large data, write to disk or object storage and pass the path. Airflow’s “XCom backends” let you swap the default in-DB store for an S3 bucket; that’s the production answer once payloads get above a few MB.
The third pitfall is timezone confusion. Airflow runs in UTC by default. A schedule of "0 9 * * *" runs at 9am UTC — which is 2:30pm IST, 4am US Eastern, or 10am Central European Time, depending on the operator’s locale. The start_date is also UTC. Set a timezone value in airflow.cfg (Asia/Kolkata, America/New_York, Europe/London, etc.) if you want UI display in local time, but the scheduler still reasons internally in UTC; your schedule expressions are still UTC, only the UI rendering changes.
The fourth pitfall is editing a DAG without the scheduler picking it up. Two causes. The file does not contain both airflow and dag as substrings (Airflow’s safe-mode optimisation skips files that lack them). Or the file has a syntax error and silently fails to import. Run airflow dags list-import-errors to see every DAG file that failed to parse, with the traceback. Fix the syntax error, the DAG appears in the UI within 30 seconds.
The fifth pitfall is leaving catchup=True (the historical default). Pre-3.0 Airflow defaulted catchup to True; Airflow 3 changes the default to False for new DAGs. 8 If you copy older snippets that set catchup=True explicitly and the DAG’s start_date is a month back, the scheduler triggers ~30 backfill runs immediately on first parse. Verify the flag before deploying.
Image: dagster-io/dagster GitHub repository, used for editorial coverage of the orchestration-tool landscape this tutorial places Airflow in.
Where to go next
A short list of directions readers commonly take this from. Provider packages are the connector inventory: there is an official provider for nearly every database, cloud, and SaaS surface you can name (Snowflake, BigQuery, Databricks, Slack, GitHub, Salesforce). Each ships with operators and hooks; install with uv pip install "apache-airflow-providers-<name>". The full inventory is on PyPI under the apache-airflow-providers- prefix. 7
Sensors are tasks that wait for a condition before completing: a file landing in S3, a SQL query returning a row, an external API returning 200. S3KeySensor, SqlSensor, HttpSensor are the common ones. Use mode="reschedule" rather than the default mode="poke" for long-running waits; reschedule mode releases the worker slot between polls instead of blocking it.
Datasets and data-aware scheduling are an Airflow 2.4+ feature that lets DAG B trigger automatically when DAG A produces a “dataset” (typically a URI like s3://... or postgres://.../table). It replaces the older pattern of chaining DAGs via ExternalTaskSensor. Airflow 3 expanded this with a richer event-driven model that integrates with cloud-native triggers. 8
Production deployment. airflow standalone is local-only. Production uses the LocalExecutor (single machine), the CeleryExecutor (Redis or RabbitMQ for distributed workers), or the KubernetesExecutor (each task as a pod). Managed offerings, including Amazon MWAA, Google Cloud Composer, and Astronomer, sidestep the executor choice and operate Airflow as a service. For teams that don’t want to operate the metadata DB and scheduler themselves, managed is usually the right answer.
Upgrading from Airflow 2.x. The official upgrading guide lists every breaking change; the major ones are imports moving to airflow.sdk, the REST API changing from /api/v1 to /api/v2, the removal of SubDAGs and SLAs, and changes to context variables. 10 The official upgrade-check utility (Ruff with AIR301 / AIR302 rules) flags everything that needs to change in your DAG code before you bump the version.
Sources
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. Apache Airflow homepage — official project positioning, feature inventory, and the current "platform created by the community to programmatically author, schedule and monitor workflows" framing this tutorial opens with (accessed ) ↩
- 2. Apache Airflow Quick Start — the official `airflow standalone` bootstrap procedure, constraint-file install pattern, and admin password handling this tutorial follows in Steps 1 and 2 (accessed ) ↩
- 3. Apache Airflow fundamentals tutorial — DAG declaration patterns, schedule presets (`@daily`, `@hourly`, `@once`), and `start_date` / `catchup` semantics referenced in Steps 3 and 6 (accessed ) ↩
- 4. Apache Airflow TaskFlow API tutorial — the `@dag` and `@task` decorator pattern, automatic XCom serialisation, and the rationale for preferring TaskFlow over the classic `PythonOperator` pattern in Step 4 (accessed ) ↩
- 5. Apache Airflow PythonOperator howto — the operator that the `@task` decorator wraps, the `op_kwargs` and `op_args` parameters, and the relationship to TaskFlow this section relies on (accessed ) ↩
- 6. Apache Airflow Task SDK reference — the `airflow.sdk` namespace introduced in Airflow 3 that exposes the stable DAG-authoring interface (`DAG`, `dag`, `task`) used throughout this tutorial (accessed ) ↩
- 7. apache-airflow on PyPI — current release metadata, Python compatibility matrix, and the provider-package naming convention referenced in "Where to go next" (accessed ) ↩
- 8. Apache Airflow 3.0 release notes — the new `airflow.sdk` namespace, `catchup=False` default for new DAGs, Grid-view replacing Tree-view, REST API v2, and the broader breaking-change list referenced in Steps 3 and 6 (accessed ) ↩
- 9. Apache Airflow 3.2.1 release notes — current stable version this tutorial targets, with the latest patch-level changes since 3.0 GA (accessed ) ↩
- 10. Apache Airflow Upgrading to Airflow 3 guide — full breaking-change inventory, the `airflow.sdk` import migration, the Ruff `AIR301`/`AIR302` upgrade-check utility, and the minimum Python 3.9 / Airflow 2.7 prerequisites referenced in the intro and "Where to go next" (accessed ) ↩
- 11. uv documentation (Astral) — the fast Python package manager and resolver this tutorial uses in Step 1 as the recommended alternative to plain pip for the constraint-file install (accessed ) ↩
Anonymous · no cookies set