dbt SQL Transformation Tutorial: From Raw Warehouse to Clean Marts
A SQL-fluent analyst's walkthrough of dbt-core 1.11: install, project structure, sources, models, seeds, tests, docs, macros, and materialisations end-to-end.
What you’ll need
dbt is the SQL transformation layer that sits between raw warehouse tables and the clean marts your dashboards and ML pipelines actually read from. You write SELECT statements; dbt handles dependency resolution, materialisation strategy, tests, and documentation. This tutorial walks a SQL-fluent analyst from pip install to a working three-layer project (staging, intermediate, marts) with sources, seeds, tests, macros, and docs, using dbt-core 1.11.10, the current stable release dated 14 May 2026. 1
The cited dbt documentation frames a working project as six discrete pieces: project structure, sources, models with materialisations, seeds, tests, and Jinja macros. We build each piece in order against DuckDB (the easiest adapter to spin up locally because it requires no separate database server) and call out the swaps you’d make against Postgres, Snowflake, or BigQuery.
Image: docs.getdbt.com landing page, used for editorial coverage of the tooling covered in this tutorial.
Prerequisites:
- Python 3.9 or newer for the
dbt-corepackage and the adapter. - Working SQL (joins, aggregates, window functions, CTEs).
- A virtualenv. We use
python -m venv;uvand Conda work identically. - A warehouse. This tutorial uses DuckDB so you can run the whole thing offline; the same project structure works against Postgres, Snowflake, BigQuery, Redshift, and the other adapters listed in dbt’s connect-to-adapters docs. 2
Step 1: install dbt-core and an adapter
dbt-core is installed alongside an adapter package; the adapter holds the warehouse-specific connection logic and SQL dialect translations. The dbt docs document the pip install path explicitly: install dbt-core plus exactly one adapter for the warehouse you’re targeting. 3
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install dbt-core dbt-duckdb
dbt --version
Substitute the adapter for your warehouse:
pip install dbt-postgresfor Postgrespip install dbt-snowflakefor Snowflakepip install dbt-bigqueryfor BigQuerypip install dbt-redshiftfor Redshift
The adapter packages are listed and maintained on the dbt Labs adapters page. 4 dbt --version should print the dbt-core version and the installed adapter version; mismatched majors here are the most common source of “command works in CI but not locally” tickets, so confirm both lines before moving on.
Step 2: initialise a project
dbt init analytics
cd analytics
dbt init is interactive: it asks for a project name and which adapter profile to scaffold. Pick duckdb for the rest of this tutorial. The command writes a project skeleton with the following layout:
analytics/
dbt_project.yml # project config: name, version, paths, model defaults
models/ # SQL files; one per model
example/ # the starter models init writes; delete after Step 4
seeds/ # CSV files materialised as warehouse tables
snapshots/ # slowly-changing-dimension snapshots (out of scope here)
macros/ # reusable Jinja
tests/ # custom singular tests
analyses/ # one-off analytical SQL, not built by `dbt run`
README.md
Set up the connection profile dbt uses to find your warehouse. Profiles live in ~/.dbt/profiles.yml (not in the project) so credentials never check in. The DuckDB profile is two lines:
analytics:
target: dev
outputs:
dev:
type: duckdb
path: 'dev.duckdb'
threads: 4
For Postgres or Snowflake, the profile shape is the same but the fields differ (host, port, user, password for Postgres; account, warehouse, role, database, schema for Snowflake). The dbt adapter docs list every field for every supported warehouse. 5
Test the connection.
dbt debug
A clean dbt debug ends with All checks passed!. If it doesn’t, the printed error tells you whether the problem is the profile file, the credentials, or the adapter install.
Image: dbt documentation — about dbt models, used for editorial coverage of the model concept covered in this step.
Step 3: declare sources
A source is a raw warehouse table dbt didn’t create but knows about. Declaring sources gives you three things: a place to attach freshness checks, a place to attach tests against raw data, and a source() Jinja function that lets downstream models reference raw tables without hard-coding fully-qualified names. 6
Create models/staging/_sources.yml:
version: 2
sources:
- name: raw
description: "Raw landed data from upstream loaders."
schema: raw
tables:
- name: orders
description: "One row per order at the moment of checkout."
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- name: customers
description: "One row per customer record."
columns:
- name: customer_id
tests:
- unique
- not_null
For the tutorial to run end-to-end, seed two sample tables into the raw schema. With DuckDB the simplest path is a seed-style CSV (covered in Step 6) or a one-off CREATE TABLE against the dev.duckdb file using the duckdb CLI. With a hosted warehouse this is the upstream loader’s job.
Step 4: write your first model
A dbt model is a single SQL file under models/ whose contents are a SELECT statement. dbt wraps the SELECT in a CREATE TABLE AS (or CREATE VIEW AS, depending on materialisation) and writes the result to the warehouse using the model’s file name as the table name. 7
Create models/staging/stg_orders.sql:
with source as (
select * from {{ source('raw', 'orders') }}
),
renamed as (
select
order_id,
customer_id,
cast(order_ts as date) as order_date,
country_code as country,
category,
unit_price_cents / 100.0 as unit_price,
quantity,
coalesce(discount_pct, 0) as discount_pct,
payment_method
from source
)
select * from renamed
Three patterns to internalise. First, {{ source('raw', 'orders') }} resolves at compile time to the fully-qualified raw table name based on the _sources.yml declaration; you never hard-code raw.orders in model SQL. Second, the staging-layer convention is one model per source table that does light cleaning only (rename, cast, coalesce) and nothing else. Third, the CTE-then-final-select pattern keeps logic readable; dbt’s style guide and the broader analytics-engineering community converged on this shape for a reason.
Create models/staging/stg_customers.sql:
with source as (
select * from {{ source('raw', 'customers') }}
),
renamed as (
select
customer_id,
full_name,
email,
cast(signup_ts as date) as signup_date,
country_code as country,
segment
from source
)
select * from renamed
Build them.
dbt run --select staging
dbt prints a table summarising what built, in what time, and what materialisation. By default models build as views: fast to create, no warehouse storage cost, recomputed every time they’re queried.
Step 5: write a mart model that references staging
A mart model is the consumer-facing layer: denormalised, business-friendly, the thing dashboards and analysts actually query. It references staging models via the ref() function, never via raw table names. ref() is the function that builds dbt’s dependency graph; when dbt sees {{ ref('stg_orders') }} in a mart, it knows the mart depends on stg_orders and orders the build accordingly. 8
Create models/marts/fct_orders.sql:
{{ config(materialized='table') }}
with orders as (
select * from {{ ref('stg_orders') }}
),
customers as (
select * from {{ ref('stg_customers') }}
),
joined as (
select
o.order_id,
o.order_date,
o.customer_id,
c.segment,
c.country as customer_country,
o.country as order_country,
o.category,
o.unit_price,
o.quantity,
o.discount_pct,
round(o.unit_price * o.quantity * (1 - o.discount_pct / 100.0), 2) as revenue
from orders o
left join customers c using (customer_id)
)
select * from joined
The {{ config(materialized='table') }} at the top overrides the project default (view) for this single model. dbt run --select fct_orders will now CREATE TABLE fct_orders AS SELECT ... rather than creating a view.
dbt’s documented materialisations are view, table, incremental, ephemeral, and materialized_view (warehouse-dependent). 9 The choice matrix:
| Materialisation | When to use |
|---|---|
view | Default. Cheap. Fine for staging and intermediate layers and for marts under ~1M rows. |
table | Marts that get queried many times per day. Trades build-time storage for query-time speed. |
incremental | Append-only or slowly-mutating large tables. dbt only processes new rows on each run. |
ephemeral | Compiled inline as a CTE in downstream models. No warehouse object created. |
materialized_view | Warehouse-native MV semantics on adapters that support it (Snowflake, BigQuery, Postgres ≥ 14). |
Step 6: seeds
A seed is a CSV file in the seeds/ directory that dbt loads into the warehouse as a table when you run dbt seed. Seeds are for small, mostly-static reference data: country code lookups, segment definitions, currency conversion tables, ML feature flag mappings. The dbt docs are explicit that seeds are not for source data and not for anything that changes often. 10
Create seeds/country_codes.csv:
country_code,country_name,region
IN,India,APAC
US,United States,AMER
GB,United Kingdom,EMEA
DE,Germany,EMEA
JP,Japan,APAC
Build it.
dbt seed
Reference it from a model the same way you reference any other dbt-built object: {{ ref('country_codes') }}. dbt infers column types from the CSV; for explicit control, declare them in seeds/_seeds.yml with a column_types block.
Step 7: tests
Tests are SQL assertions about your models. dbt ships four generic tests out of the box (unique, not_null, accepted_values, relationships) and lets you write custom singular tests as .sql files in tests/. 11
Add tests to models/marts/_marts.yml:
version: 2
models:
- name: fct_orders
description: "Order fact table; one row per order, joined to customer attributes."
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- not_null
- relationships:
to: ref('stg_customers')
field: customer_id
- name: revenue
tests:
- not_null
- name: category
tests:
- accepted_values:
values: ['electronics', 'apparel', 'home', 'books', 'grocery']
Run them.
dbt test --select fct_orders
A failed test prints the failing row count and the warehouse path to the failure-row table so you can inspect what didn’t match. The relationships test is the killer feature here: it asserts referential integrity (every customer_id in fct_orders exists in stg_customers) without needing actual foreign key constraints in the warehouse.
For tests dbt’s generic set can’t express, write a singular test as a .sql file in tests/ whose query returns failing rows. Example: tests/no_future_orders.sql:
select * from {{ ref('fct_orders') }}
where order_date > current_date
Any row the query returns is a failure. dbt test picks the file up automatically.
Image: dbt documentation — add data tests to your DAG, used for editorial coverage of the testing surface in this step.
Step 8: Jinja macros
Jinja is the templating layer that lets dbt SQL do things plain SQL can’t: looping, conditional logic, parameter substitution, abstraction over warehouse-specific dialects. A macro is a named Jinja function defined in macros/ and called from model SQL with {{ macro_name(args) }}. 12
Create macros/cents_to_currency.sql:
{% macro cents_to_currency(column_name) %}
round({{ column_name }} / 100.0, 2)
{% endmacro %}
Use it in a model:
select
order_id,
{{ cents_to_currency('unit_price_cents') }} as unit_price
from {{ source('raw', 'orders') }}
dbt ships a standard-library package called dbt_utils that bundles dozens of useful macros: surrogate_key, date_spine, pivot, deduplicate, union_relations. Install it via packages.yml:
packages:
- package: dbt-labs/dbt_utils
version: ['>=1.3.0', '<2.0.0']
Then dbt deps to install, and macros become available as {{ dbt_utils.surrogate_key(['col_a', 'col_b']) }}. The dbt-utils package is the single biggest force-multiplier in a real dbt project; check it for a macro before writing your own.
Step 9: docs
dbt auto-generates a documentation site from the .yml files you’ve been writing alongside models. The descriptions, columns, tests, and dependency graph render as a navigable HTML site.
dbt docs generate
dbt docs serve
The second command opens a local server (default port 8080) with the rendered docs. The dependency-graph view (top-right “view DAG” icon) is worth showing to non-engineer stakeholders the first time you ship a dbt project; it makes the staging → intermediate → marts story legible in a way that prose can’t.
The descriptions you wrote in _sources.yml and _marts.yml are what render in the docs site. Treat description fields as the canonical place to document the business meaning of every table and column; the cost of doing it inline at model-write time is far less than the cost of writing a separate data dictionary that drifts.
Step 10: the build command
dbt run builds models. dbt test runs tests. dbt seed loads seeds. dbt snapshot captures snapshots. dbt build does all four in dependency order in a single invocation, which is what you want for any scheduled production run. 13
dbt build
The build command stops at the first failure by default, which means a broken model upstream won’t waste compute on downstream models that would also fail. For CI, dbt build --fail-fast makes the stop-on-failure behaviour explicit. For development against a subset of the project, dbt build --select marts+ builds the marts and everything downstream of them.
Image: dbt documentation — Jinja and macros, used for editorial coverage of the macro patterns covered in this step.
Common pitfalls
Three failures show up repeatedly in dbt projects past the toy-project size:
source() vs ref() confusion. Source-declared raw tables are referenced via source('schema_alias', 'table_name'); dbt-built models are referenced via ref('model_name'). Using the wrong one breaks the dependency graph silently (dbt won’t error, but the run-order will be wrong on a fresh build). Lint for it: every model should have zero hard-coded fully-qualified table names.
Mart materialisation drift. Marts default to view, get slow under heavy querying, and the team flips one to table without thinking through partitioning or incremental. Decide materialisation per mart based on row count, update frequency, and query patterns; document the choice in the model’s YAML description so reviewers don’t have to reconstruct the reasoning.
Tests-as-decoration. Adding unique and not_null to columns where the assertions are trivially true is busywork; adding them to columns where the assertions are load-bearing (order_id unique, customer_id not_null, referential integrity from facts to dims) is the entire point. Treat tests as production invariants, not as completeness theatre.
Where to go next
Image: dbt-labs/dbt-bigquery GitHub repository, used for editorial coverage of the adapter ecosystem referenced throughout this tutorial.
The dbt documentation site is the canonical reference; the sections we walked through have deep dives on snapshots (slowly-changing dimensions), exposures (downstream consumers like dashboards), metrics (the semantic-layer surface), and the metadata API for programmatic project introspection. For working analysts, the practical next step is wiring the project into CI: a GitHub Actions or GitLab CI workflow that runs dbt build against a CI warehouse on every PR catches breaking changes before they hit production.
Two ecosystem notes worth tracking. dbt Cloud is the hosted offering from dbt Labs with a web IDE, scheduler, and observability surface; dbt Fusion is the newer engine targeting performance and incremental dbt-core compatibility. Both are worth evaluating once the project has more than a couple of developers committing models. dbt-core itself remains the foundation either way; the project structure, model SQL, and tests you wrote in this tutorial port directly to either hosted surface.
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. dbt-core GitHub releases — most recent stable release is v1.11.10 dated 14 May 2026 (accessed ) ↩
- 2. dbt documentation — connect to adapters; supported warehouses include Postgres, Snowflake, BigQuery, Redshift, DuckDB (accessed ) ↩
- 3. dbt documentation — install dbt with pip; install dbt-core alongside one adapter package (accessed ) ↩
- 4. dbt documentation — adapter packages dbt-postgres, dbt-snowflake, dbt-bigquery, dbt-redshift, dbt-duckdb installable from PyPI (accessed ) ↩
- 5. dbt documentation — adapter-specific profile fields for each supported warehouse (accessed ) ↩
- 6. dbt documentation — sources are raw warehouse tables declared in .yml; source() Jinja function references them in model SQL (accessed ) ↩
- 7. dbt documentation — models are SQL SELECT statements in models/; dbt wraps them in CREATE TABLE AS / CREATE VIEW AS per materialisation (accessed ) ↩
- 8. dbt documentation — ref function builds the dependency graph and resolves to the fully-qualified built object name (accessed ) ↩
- 9. dbt documentation — model configurations; supported materializations are view, table, incremental, ephemeral, materialized_view (accessed ) ↩
- 10. dbt documentation — seeds are CSV files in seeds/ loaded as warehouse tables via dbt seed; intended for small static reference data (accessed ) ↩
- 11. dbt documentation — generic tests unique, not_null, accepted_values, relationships; custom singular tests as .sql files in tests/ (accessed ) ↩
- 12. dbt documentation — Jinja and macros; macro definition syntax and the dbt_utils package (accessed ) ↩
- 13. dbt documentation — dbt build runs models, tests, seeds, and snapshots in dependency order (accessed ) ↩
Anonymous · no cookies set