ServicesAboutNotesContact Get in touch →
EN FR
Note

dbt Project Structure and Naming

How to organize a dbt project — folder structure, model naming conventions, layer responsibilities, and dbt_project.yml configuration patterns

Planted
dbtdata modelingdata engineering

A dbt project’s folder structure is one of the first decisions made when starting a project and one of the hardest to change later. The notes below describe folder layout, model naming conventions, layer responsibilities, and dbt_project.yml configuration patterns.

Directory Layout

The models/ directory uses three top-level folders that mirror the three-layer architecture: base/, intermediate/, and marts/. This naming is intentional — alphabetical order matches lineage order, so opening the folder shows data flow from left to right in your DAG.

my_project/
├── dbt_project.yml
├── packages.yml
├── macros/
│ ├── _macros.yml
│ ├── generate_schema_name.sql
│ └── marketing/
│ ├── channel_grouping.sql
│ └── attribution_weight.sql
├── models/
│ ├── base/
│ │ ├── stripe/
│ │ │ ├── _stripe__sources.yml
│ │ │ ├── _stripe__models.yml
│ │ │ ├── base__stripe__payment.sql
│ │ │ └── base__stripe__customer.sql
│ │ └── ga4/
│ │ ├── _ga4__sources.yml
│ │ ├── _ga4__models.yml
│ │ └── base__ga4__event.sql
│ ├── intermediate/
│ │ ├── _int__models.yml
│ │ ├── session/
│ │ │ ├── int__session.sql
│ │ │ └── int__session__session_lj_conversion.sql
│ │ └── customer/
│ │ └── int__customer__customer_lj_order.sql
│ └── marts/
│ ├── finance/
│ │ ├── _finance__models.yml
│ │ └── mrt__finance__order.sql
│ └── marketing/
│ ├── _marketing__models.yml
│ ├── mrt__marketing__session.sql
│ └── mrt__marketing__campaign_performance.sql
├── seeds/
│ ├── _seeds.yml
│ └── channel_mapping.csv
├── snapshots/
│ └── snap__customer.sql
└── tests/
└── assert_attribution_sums_to_one.sql

Each layer uses a different organizing principle:

  • Base is organized by source system (stripe/, ga4/, hubspot/). Name folders after the source, not the loader — use stripe/, not fivetran/. Your loader might change; your source system is more stable.
  • Intermediate is organized by entity (session/, customer/). Add subfolders when you have 3+ models for the same entity. Never organize by business domain here — int__customer__customer_lj_order serves both finance and marketing.
  • Marts is organized by business domain (finance/, marketing/). This is where consumers look for data, and they think in business terms, not source systems.

Keep folder depth at three levels or fewer. Deep nesting like models/staging/external/stripe/payments/v2/ is a navigation nightmare.

The Double Underscore Naming Convention

The double underscore (__) creates unambiguous visual separation between components of a model name. Compare:

  • base__google_analytics__campaign — clearly base layer, google_analytics source, campaign entity
  • base_google_analytics_campaign — is this google + analytics_campaign? or google_analytics + campaign?

With multi-word source or entity names, the ambiguity compounds. Double underscores eliminate it.

Naming by Layer

LayerPatternExample
Basebase__[source]__[entity]base__stripe__payment
Intermediate (pure)int__[entity]int__session
Intermediate (enriched)int__[entity]__[entity1]_[join]_[entity2]int__customer__customer_lj_order
Martsmrt__[department]__[entity]mrt__marketing__campaign_performance
Snapshotssnap__[entity]snap__customer

Base models have a 1-to-1 relationship with a source table. The name encodes the source system and the entity: base__ga4__event, base__stripe__customer.

Intermediate pure models (int__session, int__customer) apply business logic to a single entity — sessionization, deduplication, complex calculations. Only create these when you are adding value beyond the base model.

Intermediate enriched models encode join information directly in the name. int__customer__customer_lj_order tells you: customer grain, LEFT JOIN to order. The join abbreviations are lj (LEFT JOIN), ij (INNER JOIN), cj (CROSS JOIN). For multiple joins, chain them: int__customer__customer_lj_order_lj_session. Verbose, but completely self-documenting — you know the grain, the joined entities, and the join types without opening the SQL.

Mart models include the business domain: mrt__marketing__session, mrt__finance__revenue. Marketing doesn’t care that session data comes from GA4; they care that it’s marketing data.

Singular Entity Names

Use singular names: customer, order, session, campaign. Each row represents one instance of the entity. This also keeps naming consistent across layers: base__stripe__customer, int__customer, and mrt__finance__customer all refer to the same entity.

YAML Organization

Use the per-directory pattern: one YAML file per folder, prefixed with underscore so it sorts to the top.

base/stripe/
├── _stripe__sources.yml # Source definitions
├── _stripe__models.yml # Model configs, tests, docs
├── base__stripe__payment.sql
└── base__stripe__customer.sql

The underscore prefix makes YAML files appear before SQL files. Including the directory name (_stripe__models rather than _models) speeds up fuzzy-finding in editors.

Keep source definitions and model definitions in separate files. _stripe__sources.yml holds sources: blocks, freshness tests, and source documentation. _stripe__models.yml holds model configurations, column tests, and descriptions. Mixing them creates confusion.

Never use a monolithic schema.yml at the project root. A 2000-line YAML file is unsearchable, unmaintainable, and generates merge conflicts constantly.

dbt_project.yml Configuration

The project file controls defaults. Set materialization to table globally, then override where needed:

name: my_project
version: '1.0.0'
vars:
session_timeout_minutes: 30
models:
my_project:
+materialized: table
base:
+schema: base
ga4:
+materialized: incremental
+incremental_strategy: insert_overwrite
intermediate:
+schema: intermediate
marts:
+schema: marts
marketing:
+group: marketing
+access: public

This configuration does several things. The +schema property pushes each layer into its own schema (base, intermediate, marts), making it clear in the warehouse which layer a table belongs to. High-volume sources like GA4 override the default to use incremental materialization. Groups and access modifiers document ownership and enforce boundaries between domains.

Tables everywhere is the recommended default. Storage is cheap; debugging visibility is not. Views recompute on every query and cascade schema breaks instantly. Ephemeral models are invisible in the warehouse, making debugging impossible. Reserve incremental for tables exceeding millions of rows, and view for the rare case where data must be fresh within minutes.

Macros, Seeds, Snapshots, and Tests

Macros group by domain or function. Override macros (generate_schema_name.sql) live at the macros/ root. Utility macros go in utils/. Domain-specific macros go in subfolders (marketing/, finance/). One macro per file, filename matching macro name — when you need channel_grouping, you know it lives in channel_grouping.sql. Document every macro in _macros.yml with purpose, arguments, and a usage example.

Seeds are CSV files for static lookup tables that don’t exist in any source system: UTM-to-channel mappings, country codes, internal IP addresses to exclude. Don’t use seeds for loading actual data or large datasets.

Snapshots create Type 2 slowly changing dimension records. Name them snap__[entity]. As of dbt 1.9+, you can define snapshots in YAML instead of SQL.

Tests split into generic tests (declared in YAML alongside models) and singular tests (SQL files in tests/). At minimum, test every primary key for unique and not_null. For a deeper look at the full range of test types — unit tests, contracts, dbt-expectations — see the dbt testing taxonomy.

Enforcing Conventions

Conventions only work if they are enforced. The dbt-project-evaluator package audits your project against best practices automatically: missing primary key tests, models without descriptions, direct source references in marts, naming violations. Add it to packages.yml and run it in CI. You can configure it to match your conventions (for example, base__ prefixes instead of the default stg_).

For teams using Claude Code, documenting your naming conventions in CLAUDE.md at the project root gives the AI persistent memory of your structure. This prevents it from generating stg_ prefixes when your project uses base__, or creating models in the wrong directory.

Common Structural Mistakes

Organizing base by loader instead of source. fivetran/ and airbyte/ are implementation details. stripe/ and ga4/ are actual data origins.

Business logic in base models. If you see CASE statements with business rules in a base model, that logic belongs in intermediate. Base should be mechanical: rename, cast, filter, deduplicate.

Skipping intermediate. Marts with 10+ JOINs and duplicated logic across multiple models are a sign you need to centralize shared joins and transformations in intermediate.

Over-nesting folders. If you have to click through five directories to find a model, your structure is working against you.

Ephemeral everything. You cannot SELECT * FROM int__session LIMIT 100 if it’s ephemeral. Tables are cheap. Debugging visibility is priceless.

Monolithic YAML. One file per directory. Always.