dbt Base Layer Patterns

Base models make raw source data usable without adding interpretation. A base model should be readable by anyone unfamiliar with the business domain — if it requires explanation, it is doing too much.

The dbt Three-Layer Architecture places base at the bottom of the lineage: source → base → intermediate → marts. Base models act as a contract between your dbt project and the raw data. When a source schema changes, only base models break — not everything downstream. When you need to debug a calculation, tracing it back through intermediate to a trustworthy base layer is straightforward.

What Belongs in Base Models

Renaming columns to your naming standards. Source systems use inconsistent naming. Your project should have one convention. If you use the double-underscore pattern, user_id becomes customer__id, created_at becomes order__created_at. This is mechanical work, not interpretation.

Casting data types. Strings to timestamps, integers to booleans, varchars to proper numeric types. Source systems are often sloppy about types. Fix that here so downstream models can trust the data types they receive.

Deduplicating source records. Many ETL tools produce duplicate rows — CDC replays, overlapping incremental loads, retry logic. Use ROW_NUMBER() with QUALIFY (on BigQuery) to keep only the most recent version of each record. This is a window function pattern you’ll use constantly.

Unnesting nested or repeated fields. Common with GA4 event data, JSON payloads, and API responses that store arrays inside rows. Flatten these in base so downstream models work with standard tabular data.

Filtering out invalid records. Test data, soft-deleted rows, records that should never have been loaded. Remove them here so no downstream model has to remember to filter them out.

What Doesn’t Belong

Joins. Base models have a 1-to-1 relationship with a source table. One source, one base model. Joins are intermediate concerns.

Business logic or calculations. No CASE WHEN statements categorizing orders. No margin calculations. No customer tiers. If you’re interpreting the data, you’re in the wrong layer. Move it to intermediate.

Aggregations of any kind. Base models never use GROUP BY as their final output. Every row in the source should produce at most one row in the base model (fewer if you’re filtering or deduplicating).

The One Exception to No-Joins

Some ETL tools split what’s logically one API endpoint into multiple tables. If your source has orders and order_metadata that should have always been one table, joining them in base is fine. You’re correcting an ETL artifact, not adding business logic.

The test: would the source system’s API return these as one object? If yes, merging them in base is acceptable. If no — if you’re joining orders to customers, for instance — that belongs in intermediate.

A Typical Base Model

This example shows all the standard patterns: selecting from a source, deduplicating, renaming, casting, and filtering.

{{ config(
    materialized='table',
    tags=['base', 'shopify']
) }}

WITH source AS (
    SELECT
        id,
        user_id,
        created_at,
        total,
        status,
        is_deleted,
        _loaded_at
    FROM {{ source('shopify', 'orders') }}
),

deduplicated AS (
    SELECT
        id,
        user_id,
        created_at,
        total,
        status,
        is_deleted,
        _loaded_at
    FROM source
    QUALIFY ROW_NUMBER() OVER (
        PARTITION BY id
        ORDER BY _loaded_at DESC
    ) = 1
),

renamed AS (
    SELECT
        id AS order__id,
        user_id AS customer__id,
        created_at AS order__created_at,
        CAST(total AS FLOAT64) AS order__amount_usd,
        status AS order__status,
        CAST(is_deleted AS BOOL) AS order__is_deleted,
        _loaded_at
    FROM deduplicated
)

SELECT
    order__id,
    customer__id,
    order__created_at,
    order__amount_usd,
    order__status,
    order__is_deleted,
    _loaded_at
FROM renamed
WHERE order__is_deleted = FALSE

Notice the CTE structure: source → deduplicated → renamed → final SELECT. This is a pattern, not a mandate, but it keeps every base model readable and consistent across your project. The final WHERE clause filters out soft-deleted records — you never want those downstream.

There’s no business logic here. No margin calculations, no order categorization, no customer segmentation. Just clean, consistently named, correctly typed data.

Base Model CTE Conventions

A consistent CTE structure across all base models makes the project navigable:

source — Select from the {{ source() }} macro. List columns explicitly rather than using SELECT *, so schema changes don’t silently add unwanted columns.
deduplicated — Remove duplicate records using ROW_NUMBER() partitioned by the primary key, ordered by a metadata timestamp like _loaded_at.
renamed — Rename columns to project conventions and cast data types.
Final SELECT — Apply any filters (soft deletes, test data) and select the final column list.

Not every base model needs all four steps. A source with no duplicates skips the deduplication CTE. A source with clean column names might combine renaming into the source CTE. But when the steps are present, they should follow this order.

Materialization

Base models should almost always be materialized as table. The dbt Three-Layer Architecture benefits from base models being queryable for debugging — if you materialize as view, every query against base recomputes. If you materialize as ephemeral, the model doesn’t exist in your warehouse at all, making debugging impossible.

The exception: very high-volume sources (GA4 events, clickstream data) where base models handle billions of rows. These benefit from incremental materialization to avoid rebuilding the entire table on every run. Configure this at the folder level in dbt_project.yml rather than per-model.

Complexity Indicator

If a base model needs inline comments to explain what it is doing, it is probably too complex. Base models should be mechanical enough that any developer can read the SQL and understand the transformation without explanation. Business logic — anything that categorizes, scores, or interprets data — belongs in intermediate, not here.