Base models make raw source data usable without adding interpretation. A base model should be readable by anyone unfamiliar with the business domain — if it requires explanation, it is doing too much.
The dbt Three-Layer Architecture places base at the bottom of the lineage: source → base → intermediate → marts. Base models act as a contract between your dbt project and the raw data. When a source schema changes, only base models break — not everything downstream. When you need to debug a calculation, tracing it back through intermediate to a trustworthy base layer is straightforward.
What Belongs in Base Models
Renaming columns to your naming standards. Source systems use inconsistent naming. Your project should have one convention. If you use the double-underscore pattern, user_id becomes customer__id, created_at becomes order__created_at. This is mechanical work, not interpretation.
Casting data types. Strings to timestamps, integers to booleans, varchars to proper numeric types. Source systems are often sloppy about types. Fix that here so downstream models can trust the data types they receive.
Deduplicating source records. Many ETL tools produce duplicate rows — CDC replays, overlapping incremental loads, retry logic. Use ROW_NUMBER() with QUALIFY (on BigQuery) to keep only the most recent version of each record. This is a window function pattern you’ll use constantly.
Unnesting nested or repeated fields. Common with GA4 event data, JSON payloads, and API responses that store arrays inside rows. Flatten these in base so downstream models work with standard tabular data.
Filtering out invalid records. Test data, soft-deleted rows, records that should never have been loaded. Remove them here so no downstream model has to remember to filter them out.
What Doesn’t Belong
Joins. Base models have a 1-to-1 relationship with a source table. One source, one base model. Joins are intermediate concerns.
Business logic or calculations. No CASE WHEN statements categorizing orders. No margin calculations. No customer tiers. If you’re interpreting the data, you’re in the wrong layer. Move it to intermediate.
Aggregations of any kind. Base models never use GROUP BY as their final output. Every row in the source should produce at most one row in the base model (fewer if you’re filtering or deduplicating).
The One Exception to No-Joins
Some ETL tools split what’s logically one API endpoint into multiple tables. If your source has orders and order_metadata that should have always been one table, joining them in base is fine. You’re correcting an ETL artifact, not adding business logic.
The test: would the source system’s API return these as one object? If yes, merging them in base is acceptable. If no — if you’re joining orders to customers, for instance — that belongs in intermediate.
A Typical Base Model
This example shows all the standard patterns: selecting from a source, deduplicating, renaming, casting, and filtering.
{{ config( materialized='table', tags=['base', 'shopify']) }}
WITH source AS ( SELECT id, user_id, created_at, total, status, is_deleted, _loaded_at FROM {{ source('shopify', 'orders') }}),
deduplicated AS ( SELECT id, user_id, created_at, total, status, is_deleted, _loaded_at FROM source QUALIFY ROW_NUMBER() OVER ( PARTITION BY id ORDER BY _loaded_at DESC ) = 1),
renamed AS ( SELECT id AS order__id, user_id AS customer__id, created_at AS order__created_at, CAST(total AS FLOAT64) AS order__amount_usd, status AS order__status, CAST(is_deleted AS BOOL) AS order__is_deleted, _loaded_at FROM deduplicated)
SELECT order__id, customer__id, order__created_at, order__amount_usd, order__status, order__is_deleted, _loaded_atFROM renamedWHERE order__is_deleted = FALSENotice the CTE structure: source → deduplicated → renamed → final SELECT. This is a pattern, not a mandate, but it keeps every base model readable and consistent across your project. The final WHERE clause filters out soft-deleted records — you never want those downstream.
There’s no business logic here. No margin calculations, no order categorization, no customer segmentation. Just clean, consistently named, correctly typed data.
Base Model CTE Conventions
A consistent CTE structure across all base models makes the project navigable:
source— Select from the{{ source() }}macro. List columns explicitly rather than usingSELECT *, so schema changes don’t silently add unwanted columns.deduplicated— Remove duplicate records usingROW_NUMBER()partitioned by the primary key, ordered by a metadata timestamp like_loaded_at.renamed— Rename columns to project conventions and cast data types.- Final SELECT — Apply any filters (soft deletes, test data) and select the final column list.
Not every base model needs all four steps. A source with no duplicates skips the deduplication CTE. A source with clean column names might combine renaming into the source CTE. But when the steps are present, they should follow this order.
Materialization
Base models should almost always be materialized as table. The dbt Three-Layer Architecture benefits from base models being queryable for debugging — if you materialize as view, every query against base recomputes. If you materialize as ephemeral, the model doesn’t exist in your warehouse at all, making debugging impossible.
The exception: very high-volume sources (GA4 events, clickstream data) where base models handle billions of rows. These benefit from incremental materialization to avoid rebuilding the entire table on every run. Configure this at the folder level in dbt_project.yml rather than per-model.
Complexity Indicator
If a base model needs inline comments to explain what it is doing, it is probably too complex. Base models should be mechanical enough that any developer can read the SQL and understand the transformation without explanation. Business logic — anything that categorizes, scores, or interprets data — belongs in intermediate, not here.