Adrienne Vermorel

dbt Project Structure: The Definitive Guide

dbt project structure is one of the first decisions you make when starting a new project, and one of the hardest to change later. Get it wrong and six months from now you’ll be staring at a tangled DAG, wondering which of your 47 base models actually feeds the dashboard that just broke.

The challenge is that much of what you’ll find online is either outdated (referencing patterns that dbt Labs has since walked back) or too theoretical (explaining what to do without explaining why it matters). This guide is different. It’s based on auditing projects with 400+ models and building marketing analytics pipelines from scratch. These patterns have been stress-tested on real client projects.

This article covers a complete, opinionated framework for structuring dbt projects. It starts with foundations for beginners, then takes clear stances on the debates that experienced practitioners actually care about. It goes beyond just the models/ folder to cover macros, tests, seeds, snapshots, and YAML organization. Everything is grounded in practical examples from marketing analytics: GA4, advertising platforms, and attribution modeling.

The thesis is simple: consistency matters more than any specific convention, but you need a convention to be consistent with. The goal is a structure where anyone on your team can find any model in under 10 seconds.

Time to build that structure.


The Three-Layer Architecture: base → intermediate → marts

The foundation of any well-structured dbt project is a clear layered architecture. Data should flow from source-conformed (shaped by external systems you don’t control) to business-conformed (shaped by your organization’s needs and definitions).

The layers I use are base, intermediate, and marts. You’ll see “staging” used instead of “base” in much of the dbt documentation, but I prefer base/intermediate/marts for a simple reason: alphabetical order matches lineage order. When you open your models/ folder, you see base/ first, then intermediate/, then marts/. That’s exactly the order data flows through your project. It’s a small thing, but it reduces cognitive load every single day.

The base layer

The base layer is where source data enters your dbt project. Each base model has a 1-to-1 relationship with a source table. No exceptions.

What belongs in base models:

  • Renaming columns to consistent conventions (snake_case, removing prefixes like c_ or tbl_)
  • Type casting (strings to dates, integers to booleans)
  • Basic unit conversions (cents to dollars, milliseconds to seconds)
  • Timezone conversions
  • Filtering out rows you never want (test data, hard deletes via WHERE NOT _fivetran_deleted)
  • Handling source-specific quirks (unnesting repeated fields, deduplicating on primary key)

What does NOT belong in base models:

  • Joins. Never. If you need data from another source, that’s an intermediate model.
  • Aggregations. Never. Base models should maintain the same grain as the source.
  • Business logic. If you’re writing CASE statements with business rules, you’re in the wrong layer.

The base layer should be mechanical. Someone unfamiliar with your business domain should be able to understand every transformation just by reading the SQL.

Naming convention: base__[source]__[entity]

The double underscore between each component is intentional. It creates unambiguous visual separation. base__google_analytics__event is immediately parseable; base_google_analytics_event could be base_google + analytics_event or base + google_analytics_event.

Folder structure:

models/
└── base/
├── stripe/
│ ├── _stripe__sources.yml
│ ├── _stripe__models.yml
│ ├── base__stripe__payment.sql
│ └── base__stripe__customer.sql
└── ga4/
├── _ga4__sources.yml
├── _ga4__models.yml
└── base__ga4__event.sql

Organize base models by source system, not by business domain or data loader. Use stripe/, not fivetran/ or payment/. The source system is the natural organizing principle at this layer because that’s where the data actually comes from.

Marketing analytics example: A base__ga4__event model takes the raw GA4 export, unnests the event_params repeated field into proper columns, casts event_timestamp from microseconds to a proper timestamp, filters out debug_mode = 1 events, and renames user_pseudo_id to ga_client_id. No sessionization, no attribution, no business logic.

The intermediate layer

The intermediate layer is where base models combine to form coherent entities. This is where your project’s intellectual work happens.

My opinionated take: intermediate models should be entity-centric, not transformation-centric.

The conventional dbt guidance suggests naming intermediate models after the transformation they perform: int_payments_pivoted_to_order, int_events_sessionized, int_customers_aggregated_by_account. I think this is wrong.

Instead, intermediate models should represent clean, well-defined entities at a specific grain. There are two types:

  1. Pure entity models (int__session, int__customer) when you need to apply business logic to a single base model (deduplication, sessionization, complex calculations)
  2. Enriched entity models (int__customer__customer_lj_order) when you’re joining entities together

The enriched model naming convention is self-documenting: int__[primary_entity]__[entity1]_[join_type]_[entity2]. The join type abbreviations are:

  • lj = LEFT JOIN
  • ij = INNER JOIN
  • cj = CROSS JOIN

For multiple joins, chain them: int__customer__customer_lj_order_lj_session.

Why does this matter?

  1. Self-documenting. You can read the model name and know exactly what’s joined and how. No need to open the SQL to understand the grain and composition.
  2. Reusability. Entity-based models like int__session can be referenced by multiple marts. The enriched versions make the dependencies explicit.
  3. Debuggability. When something breaks, the name tells you where to look. int__customer__customer_lj_order failing? Check the customer-to-order join logic.

What belongs in intermediate models:

  • Joining base models to create enriched entities
  • Sessionization logic (grouping events into sessions)
  • Deduplication across sources (merging customer records from CRM and billing)
  • Identity resolution
  • Complex business logic that shouldn’t clutter marts

Important: Don’t create an int__customer model if it’s identical to base__crm__customer. Only create intermediate models when you’re adding value through joins, business logic, or transformations. Duplicating base models wastes compute and clutters your DAG.

Unlike base models (organized by source) or marts (organized by business domain), intermediate is organized by entity. If you have enough models, you can create subfolders by entity (intermediate/customer/, intermediate/session/) but never by business domain like intermediate/marketing/ or intermediate/finance/. Intermediate models often span domains: int__customer__customer_lj_order serves both finance and marketing. Forcing it into one domain’s folder creates artificial boundaries.

Folder structure:

models/
└── intermediate/
├── _int__models.yml
├── session/
│ ├── int__session.sql
│ └── int__session__session_lj_conversion.sql
├── customer/
│ ├── int__customer__customer_lj_order.sql
│ └── int__customer__customer_lj_order_lj_session.sql
├── int__unified_ad_spend.sql
└── int__attribution_touchpoint.sql

For smaller projects, keep it flat. Add entity subfolders when you have 3+ models for the same entity.

Marketing analytics example: An int__session model takes base__ga4__event, applies your 30-minute session timeout logic, calculates session duration, and assigns each session to a traffic source. Then int__session__session_lj_conversion enriches sessions with conversion data (the name tells you exactly what you’re getting).

The marts layer

The marts layer produces tables ready for consumption by end users: analysts, dashboards, reverse ETL, ML models.

Another opinionated stance: I don’t use fct_ and dim_ prefixes.

Dimensional modeling (facts and dimensions, star schemas, Kimball methodology) was designed for OLAP cubes and the constraints of 1990s data warehouses. Those constraints no longer exist. Modern BI tools handle wide, denormalized tables perfectly well. Your analysts don’t care whether customer is technically a dimension; they just want to query customer data.

The prefixes add ceremony without adding clarity. fct_order tells you nothing that order doesn’t. And the mental overhead of deciding “is this a fact or a dimension?” is a distraction from the actual work of modeling your data well.

When fct_/dim_ might still make sense: If you’re in an enterprise environment with strict data governance requirements, existing Kimball conventions, or a team that thinks in dimensional modeling terms, keep the prefixes. Consistency with your organization’s norms beats my preferences.

What belongs in marts:

  • Wide, denormalized tables ready for dashboards
  • Pre-aggregated metrics
  • Domain-specific business logic
  • Final business naming (customer-friendly column names)

Naming convention: mrt__[department]__[entity]. For example: mrt__marketing__campaign_performance, mrt__finance__order, mrt__marketing__session.

Folder structure (by business domain):

models/
└── marts/
├── finance/
│ ├── _finance__models.yml
│ ├── mrt__finance__order.sql
│ └── mrt__finance__revenue.sql
└── marketing/
├── _marketing__models.yml
├── mrt__marketing__session.sql
├── mrt__marketing__campaign_performance.sql
└── mrt__marketing__channel_attribution.sql

At the marts layer, organize by business domain (the teams and use cases that will consume the data). Marketing doesn’t care that session data comes from GA4; they care that it’s marketing data.

Marketing analytics example: A mrt__marketing__campaign_performance mart joins int__unified_ad_spend with conversion data from int__conversion, calculates ROAS, CPA, and other marketing metrics, and produces a table with one row per campaign per day. It’s ready to plug directly into a dashboard or export to a spreadsheet.

When do you need all three layers?

Not every project needs all three layers.

Simple projects (< 20 models): Base + marts may suffice. If your marts are straightforward aggregations of single base models, an intermediate layer adds unnecessary complexity.

Complex projects: Once you’re joining 3+ base models in a single mart, extract the shared logic into intermediate. Once you have multiple marts that need the same joined/transformed data, intermediate becomes essential for maintainability.

Rule of thumb: If you’re copying and pasting JOIN logic between marts, you need an intermediate model.


Naming Conventions: The Double Underscore Standard

Naming conventions are one of those topics where consistency matters far more than which specific convention you choose. That said, here’s what I recommend:

LayerPatternExample
Basebase__[source]__[entity]base__stripe__payment
Intermediate (pure)int__[entity]int__session
Intermediate (enriched)int__[entity]__[entity1]_[join]_[entity2]int__customer__customer_lj_order
Martsmrt__[department]__[entity]mrt__marketing__campaign_performance

Why double underscores?

The double underscore (__) creates unambiguous visual separation between components. Consider these alternatives:

  • base__google_analytics__campaign — clearly: base layer, google_analytics source, campaign entity
  • base_google_analytics_campaign — is this google + analytics_campaign? or google_analytics + campaign?

With multi-word source or entity names, the ambiguity compounds. Double underscores eliminate it.

This convention is also consistent with dbt’s source syntax ({{ source('stripe', 'payment') }}) and the broader dbt community’s practices.

The enriched model naming pattern

For intermediate models that join multiple entities, the naming convention is:

int__[primary_entity]__[entity1]_[join_type]_[entity2]_[join_type]_[entity3]

Join type abbreviations:

  • lj = LEFT JOIN
  • ij = INNER JOIN
  • cj = CROSS JOIN

Examples:

  • int__customer__customer_lj_order (customer left joined with order)
  • int__session__session_lj_conversion (session left joined with conversion)
  • int__customer__customer_lj_order_lj_session (customer → order → session chain)

This convention is verbose, but it’s self-documenting. You know the grain (primary entity), what data is included (joined entities), and how it’s joined (join types). No need to open the SQL file to understand what you’re getting.

Singular entity names

Use singular names: customer, order, session, campaign.

The rationale: entities represent a single concept. A customer table contains customer records. Each row is one customer. The entity is “customer”, not “customers”. This also keeps naming consistent across all layers: base__stripe__customer, int__customer, and mrt__finance__customer all refer to the same entity.

Source naming: use the system, not the loader

Name your sources after the source system, not the data loader:

  • stripe, ga4, hubspot, shopify
  • fivetran_stripe, stitch_hubspot, airbyte_shopify

Your loader might change. Your source system is (probably more) stable.


Materialization Strategy: Tables All the Way Down

The conventional wisdom for dbt materialization goes something like this:

LayerConventionalRationale
Staging/BaseviewFresh data, save storage
IntermediateephemeralJust CTEs, not real tables
MartstablePerformance for end users

I think this is wrong. My approach instead:

LayerMy approachRationale
BasetableDebuggable, stable
IntermediatetableDebuggable, queryable
MartstablePerformance

Everything is a table. The only variations are:

  • Incremental when volume is high (millions of rows, daily appends)
  • View when data must be fresh within minutes (rare)

The case for tables everywhere

Debugging. When something breaks (and something will break), you need to query intermediate results. With ephemeral models, those results don’t exist in your warehouse. You can’t SELECT * FROM int__session LIMIT 100 to check your sessionization logic. You’re flying blind.

Stability. Views are re-evaluated on every query. If your upstream source changes schema, every downstream view breaks immediately. Tables act as a buffer: you have time to notice and fix issues before they cascade.

Cost. Storage is cheap. Absurdly cheap. BigQuery charges $0.02 per GB per month. A table with 10 million rows might cost you $0.50/month to store. Meanwhile, recomputing views on every query costs compute, and compute is not cheap.

Simplicity. One materialization strategy is easier to understand, teach, and maintain than three.

When to use incremental

Switch to incremental materialization when:

  • Tables exceed millions of rows
  • Daily appends are the dominant pattern
  • Full refresh takes more than 5 minutes

For marketing analytics, this typically means:

  • GA4 events tables (incremental with 3-day lookback for late-arriving events)
  • Ad platform daily extracts (incremental by date, with a 30-day lookback)
  • Transaction logs

Important: Incremental models are more complex. They require careful thought about your incremental strategy, late-arriving data, and edge cases. Don’t use incremental until table materialization becomes a bottleneck.

The one exception: views for real-time freshness

Sometimes data genuinely needs to be fresh within minutes, not hours. Live campaign dashboards during a product launch. Real-time inventory levels. Fraud detection.

In these cases, views make sense because you’re accepting the query-time compute cost in exchange for freshness.

But this is rare. Most analytics use cases tolerate hourly or daily latency just fine. Don’t optimize for real-time unless you actually need it.

Configuration in dbt_project.yml

models:
my_project:
+materialized: table # Default everything to table
base:
+schema: base
ga4:
+materialized: incremental # High volume
+incremental_strategy: insert_overwrite
intermediate:
+schema: intermediate
marts:
+schema: marts

Set the default to table, then override only where needed.


YAML Organization: One File Per Directory

Three approaches exist for organizing your .yml files:

  1. Per-directory: _stripe__sources.yml, _stripe__models.yml in each folder
  2. Per-model: One YAML file per SQL file
  3. Monolithic: One giant schema.yml at the project root

Option 3 is a nightmare. Don’t do it. A 2000-line schema.yml is unsearchable, unmaintainable, and generates merge conflicts constantly.

Option 2 (per-model) is gaining popularity in large projects. It scales well because each model is self-contained. The downside: more files, and you can’t use YAML anchors to reduce repetition.

I recommend option 1: per-directory. It’s the official dbt Labs recommendation, it balances discoverability with maintainability, and it works well for projects up to several hundred models.

The per-directory pattern

base/stripe/
├── _stripe__sources.yml # Source definitions
├── _stripe__models.yml # Model configs, tests, docs
├── base__stripe__payment.sql
└── base__stripe__customer.sql

Why the underscore prefix? Files starting with _ sort to the top of the directory. Your YAML files appear before your SQL files.

Why include the directory name? Faster fuzzy-finding in editors. Searching for stripe__models is faster than searching for _models and scanning results.

What goes where

_sources.yml:

  • Source definitions (sources:)
  • Source freshness tests
  • Source-level documentation

_models.yml:

  • Model-level configurations
  • Column tests (unique, not_null, relationships, accepted_values)
  • Column descriptions
  • Contracts (if using)

Keep them separate. Mixing source and model definitions in one file creates confusion.


Beyond models/: The Full Project Structure

A dbt project is more than just models. The complete structure:

my_project/
├── analyses/
│ └── migration_validation.sql
├── dbt_project.yml
├── macros/
│ ├── _macros.yml
│ ├── generate_schema_name.sql
│ └── marketing/
│ ├── channel_grouping.sql
│ └── attribution_weight.sql
├── models/
│ ├── base/
│ ├── intermediate/
│ └── marts/
├── packages.yml
├── seeds/
│ ├── _seeds.yml
│ ├── channel_mapping.csv
│ └── utm_campaign_mapping.csv
├── snapshots/
│ ├── _snapshots.yml
│ └── snap__customer.sql
└── tests/
└── assert_attribution_sums_to_one.sql

macros/

Macros are reusable SQL snippets (functions you can call from any model).

Organization: Group by domain or function.

macros/
├── _macros.yml # Document all macros
├── generate_schema_name.sql
├── grant_permissions.sql
└── marketing/
├── channel_grouping.sql
└── session_timeout.sql

Critical: Always document your macros in _macros.yml. Include the purpose, arguments, and an example. Future you will thank present you.

macros/_macros.yml
macros:
- name: channel_grouping
description: |
Maps UTM source/medium combinations to standardized channel groups.
Returns: 'Paid Search', 'Organic Search', 'Paid Social', etc.
arguments:
- name: source
type: string
description: The traffic source (utm_source or ga source)
- name: medium
type: string
description: The traffic medium (utm_medium or ga medium)

seeds/

Seeds are CSV files that dbt loads into your warehouse. Use them for static lookup tables that don’t exist in any source system.

Good use cases:

  • UTM → campaign mappings
  • Channel grouping rules
  • Country/region codes
  • Internal IP addresses to exclude from analytics
  • Employee email → customer ID mappings (for filtering internal usage)

Bad use cases:

  • Loading actual data. If it exists in a system you have access to, use a proper EL tool.
  • Large datasets. Seeds are loaded entirely on every dbt seed run.

Marketing examples:

seeds/channel_mapping.csv
source,medium,channel_group
google,cpc,Paid Search
google,organic,Organic Search
facebook,paid,Paid Social
facebook,referral,Organic Social

snapshots/

Snapshots create Type 2 slowly changing dimension records, preserving historical states when source data changes.

Use cases:

  • Customer attributes that change over time (tier, status, owner)
  • Campaign settings history
  • Product pricing history
  • Any dimension where you need to answer “what was the value at time X?”

Naming: snap__[entity]snap__customer.sql

As of dbt 1.9+, you can define snapshots in YAML files instead of SQL, which simplifies the syntax considerably.

tests/

dbt has two types of tests:

Generic tests (defined in YAML) are your primary testing mechanism:

_models.yml
models:
- name: int__session
columns:
- name: session_id
tests:
- unique
- not_null
- name: ga_client_id
tests:
- not_null
- relationships:
to: ref('base__ga4__event')
field: ga_client_id

Singular tests (SQL files in tests/) are for complex validations spanning multiple models:

-- tests/assert_attribution_sums_to_one.sql
-- Each conversion's attribution weights should sum to 1.0
SELECT conversion_id
FROM {{ ref('mrt__marketing__channel_attribution') }}
GROUP BY conversion_id
HAVING ABS(SUM(attribution_weight) - 1.0) > 0.001

Minimum standard: Every model should have its primary key tested for unique and not_null. This catches a surprising number of issues.

analyses/

The analyses/ folder stores SQL queries that benefit from Jinja templating and version control but don’t create warehouse objects.

Use cases:

  • Migration validation queries
  • One-off audit queries
  • Ad-hoc analysis you want to preserve
  • Queries you run manually but want to track in git

Modern dbt Features (1.7+): What to Adopt

dbt has evolved significantly in recent versions. What’s worth adopting in 2026:

Contracts: Yes, for marts

Model contracts define explicit guarantees about a model’s schema (column names, data types, and constraints). When enforced, dbt validates these at build time.

models:
- name: mrt__marketing__campaign_performance
config:
contract:
enforced: true
columns:
- name: campaign_id
data_type: string
constraints:
- type: not_null
- name: date
data_type: date
constraints:
- type: not_null
- name: spend
data_type: float64

When to use: Public-facing marts consumed by dashboards, downstream systems, or other teams. Contracts act as an API schema that communicates what downstream consumers can rely on.

When to skip: Base and intermediate layers. The ceremony isn’t worth it for internal models that only you consume.

Unit testing: Yes, for complex logic

Native unit testing (dbt 1.8+) lets you test SQL logic against static inputs before materialization.

unit_tests:
- name: test_channel_grouping_paid_search
model: int__session
given:
- input: ref('base__ga4__event')
rows:
- {session_id: 'abc', source: 'google', medium: 'cpc'}
expect:
rows:
- {session_id: 'abc', channel_group: 'Paid Search'}

When to use:

  • Complex CASE statements
  • Date/time calculations
  • Attribution logic
  • Channel grouping rules
  • Any business logic where edge cases matter

When to skip: Simple transformations where the logic is obvious.

Groups and access modifiers: Yes, even for single projects

Groups organize models under common ownership and enable access control.

dbt_project.yml
groups:
- name: marketing
owner:
email: marketing-data@company.com
- name: finance
owner:
email: finance-data@company.com
models:
my_project:
marts:
marketing:
+group: marketing
+access: public
finance:
+group: finance
+access: public

Why bother in a single project?

  1. Documentation. Groups explicitly record who owns what.
  2. Future-proofing. If you ever migrate to dbt Mesh, the structure is already in place.
  3. Boundaries. Access modifiers (private, protected, public) enforce that teams don’t create unintended dependencies.

dbt Mesh: Know when you need it

dbt Mesh enables splitting a monolithic project into multiple interconnected projects. It’s powerful, but it’s also complex.

Signs you might need it:

  • 500+ models with slow CI
  • Multiple teams stepping on each other
  • Security requirements for data isolation
  • Different deployment cadences for different domains

For most projects: A single project with groups is sufficient. Don’t adopt Mesh complexity until you have Mesh problems.

dbt-project-evaluator: Yes, for enforcement

The dbt-project-evaluator package automatically audits your project against best practices. It checks for:

  • Missing primary key tests
  • Models without descriptions
  • Direct source references in marts (should go through base)
  • Naming convention violations
  • And dozens more rules

Add it to your packages.yml:

packages:
- package: dbt-labs/dbt_project_evaluator
version: 0.8.1

Run dbt build --select package:dbt_project_evaluator to see violations.

Pro tip: Customize the rules to match your conventions. The package expects stg_ prefixes by default; you can configure it for base__ patterns.


Marketing Analytics: Practical Patterns

Time to ground everything in concrete marketing analytics examples.

GA4 modeling structure

models/
├── base/ga4/
│ ├── _ga4__sources.yml
│ ├── _ga4__models.yml
│ └── base__ga4__event.sql # Unnest, cast, filter debug
├── intermediate/
│ ├── int__session.sql # Sessionization logic
│ ├── int__session__session_lj_conversion.sql # Session enriched
│ └── int__conversion.sql # Conversion events
└── marts/marketing/
├── mrt__marketing__session.sql # Session-level metrics
├── mrt__marketing__user_journey.sql # Full user paths
└── mrt__marketing__conversion_funnel.sql # Funnel analysis

Key considerations:

  • GA4 exports sharded tables (events_YYYYMMDD). Use {{ source('ga4', 'event') }} with wildcard table configuration.
  • Event parameters are nested. Unnest them in the base layer.
  • Sessionization requires a timeout rule (typically 30 minutes). Implement in intermediate.
  • Use incremental models with 3-day lookback because GA4 events can arrive late.

Multi-platform ad spend consolidation

The challenge: Google Ads, Meta, LinkedIn, TikTok all have different schemas and different naming, but the same business need.

Pattern: Normalize in base, unify in intermediate.

models/
├── base/
│ ├── google_ads/
│ │ └── base__google_ads__campaign_performance.sql
│ ├── meta_ads/
│ │ └── base__meta_ads__campaign_insights.sql
│ └── linkedin_ads/
│ └── base__linkedin_ads__campaign_analytics.sql
└── intermediate/
└── int__ad__union.sql

The int__ad__union model UNIONs all platforms into a standardized schema:

-- int__ads__union.sql
SELECT
date,
'google_ads' AS platform,
account_id,
campaign_id,
campaign_name,
impressions,
clicks,
spend,
conversions
FROM {{ ref('base__google_ads__campaign_performance') }}
UNION ALL
SELECT
date,
'meta_ads' AS platform,
account_id,
campaign_id,
campaign_name,
impressions,
clicks,
spend,
conversions
FROM {{ ref('base__meta_ads__campaign_insights') }}
-- ... additional platforms

Now every mart that needs ad spend data references this single unified source.

Attribution modeling

Start simple: first-touch and last-touch attribution are table stakes. Add multi-touch when (and if) you need it.

Structure:

models/
├── intermediate/
│ ├── int__attribution_touchpoint.sql # All touchpoints with timestamps
│ ├── int__attribution_path.sql # Ordered paths per conversion
│ └── int__conversion__conversion_lj_touchpoint.sql # Conversion enriched with paths
└── marts/marketing/
├── mrt__marketing__first_touch_attribution.sql
├── mrt__marketing__last_touch_attribution.sql
└── mrt__marketing__multi_touch_attribution.sql

The int__attribution_touchpoint model gathers all touchpoints (sessions, ad clicks, email opens) with timestamps and user identifiers. The int__attribution_path model orders them into sequences. The int__conversion__conversion_lj_touchpoint model joins conversions with their attribution paths.

Marts then apply different attribution logic to the same underlying data, making it easy to compare models.

Handling incrementality in marketing data

Different data types need different incremental strategies:

DataStrategyLookbackWhy
GA4 eventsinsert_overwrite by date3 daysLate-arriving events
Ad platform metricsinsert_overwrite by date30 daysAttribution windows update retroactively
ConversionsFull refresh or careful incrementalN/AAttribution can change for weeks

The lookback window matters. Ad platforms recalculate attribution as more data arrives: a conversion attributed to Campaign A on day 1 might shift to Campaign B by day 30. Build this into your incremental logic.


Common Mistakes and How to Avoid Them

Mistake: Skipping the intermediate layer

Symptom: Marts with 10+ JOINs, repeated logic across models, impossible to debug.

Fix: Extract shared logic into entity-based intermediate models. If multiple marts need session data, create int__session once.

Mistake: Organizing base by loader, not source

Wrong: fivetran/, stitch/, airbyte/

Right: stripe/, ga4/, hubspot/

Your loader is an implementation detail. Your source is the actual data.

Mistake: Business logic in the base layer

Symptom: Complex CASE statements, hardcoded channel mappings, business rules in staging models.

Fix: Base should be mechanical transformations only. Business logic belongs in intermediate or marts.

Mistake: Monolithic YAML files

Symptom: One 2000-line schema.yml at the project root.

Fix: One YAML file per directory. _stripe__models.yml in base/stripe/, _marketing__models.yml in marts/marketing/.

Mistake: Over-nesting folders

Wrong: models/staging/external/stripe/payments/v2/current/

Right: Keep folder depth ≤ 3 levels. Flat is better than nested.

Mistake: Ephemeral everything

Symptom: Can’t debug, can’t query intermediate results, no visibility into transformation logic.

Fix: Tables are cheap. Visibility is priceless. Materialize as tables by default.

Mistake: Skipping tests

Symptom: Duplicate primary keys, null values in NOT NULL columns, broken relationships, all discovered in production.

Fix: At minimum, test every primary key for unique and not_null. Add relationship tests for foreign keys.


Putting It All Together: A Complete Example

A complete marketing analytics project structure applying everything covered above:

marketing_analytics/
├── dbt_project.yml
├── packages.yml
├── macros/
│ ├── _macros.yml
│ └── marketing/
│ ├── channel_grouping.sql
│ └── session_timeout_minutes.sql
├── models/
│ ├── base/
│ │ ├── ga4/
│ │ │ ├── _ga4__sources.yml
│ │ │ ├── _ga4__models.yml
│ │ │ └── base__ga4__event.sql
│ │ ├── google_ads/
│ │ │ ├── _google_ads__sources.yml
│ │ │ ├── _google_ads__models.yml
│ │ │ └── base__google_ads__campaign_performance.sql
│ │ └── meta_ads/
│ │ ├── _meta_ads__sources.yml
│ │ ├── _meta_ads__models.yml
│ │ └── base__meta_ads__campaign_insight.sql
│ ├── intermediate/
│ │ ├── _int__models.yml
│ │ ├── session/
│ │ │ ├── int__session.sql
│ │ │ └── int__session__session_lj_conversion.sql
│ │ ├── int__conversion.sql
│ │ ├── int__ad__union.sql
│ │ └── int__attribution_touchpoint.sql
│ └── marts/
│ └── marketing/
│ ├── _marketing__models.yml
│ ├── mrt__marketing__session.sql
│ ├── mrt__marketing__campaign_performance.sql
│ ├── mrt__marketing__channel_attribution.sql
│ └── mrt__marketing__daily_ad_spend.sql
├── seeds/
│ ├── _seeds.yml
│ └── channel_mapping.csv
├── snapshots/
│ └── snap__campaign_setting.sql
└── tests/
└── assert_attribution_weights_valid.sql

And the key configuration in dbt_project.yml:

name: marketing_analytics
version: '1.0.0'
vars:
session_timeout_minutes: 30
attribution_lookback_days: 30
models:
marketing_analytics:
+materialized: table
base:
+schema: base
ga4:
+materialized: incremental
+incremental_strategy: insert_overwrite
intermediate:
+schema: intermediate
marts:
+schema: marts
marketing:
+group: marketing
+access: public
groups:
- name: marketing
owner:
email: marketing-data@company.com

Conclusion

The best dbt project structure is one where anyone can find any model in under 10 seconds. Everything else (naming conventions, layer definitions, materialization strategies) serves that goal.

Key takeaways:

  1. base → intermediate → marts. Three layers. Alphabetical order matches lineage order. Each layer has clear rules about what belongs there.
  2. Entity-based naming with self-documenting joins. int__session for pure entities, int__customer__customer_lj_order for enriched ones. The name tells you what’s inside.
  3. No fct_/dim_ prefixes. We’ve moved past dimensional modeling. Mart names like mrt__marketing__session are clearer.
  4. Tables everywhere. Storage is cheap. Debugging visibility is expensive. Materialize as tables by default, incremental when volume demands it.
  5. Consistency is king. Pick conventions, document them, enforce them with dbt-project-evaluator. The specific conventions matter less than applying them consistently.

If you’re starting a new project, adopt these patterns from day one. If you’re inheriting an existing project, pick your battles and don’t try to refactor everything at once. Start with naming conventions in new models, then gradually migrate the rest.

Whatever conventions you choose, write them down. A CONTRIBUTING.md in your dbt project root that documents your structure decisions is worth more than any best practices guide, including this one.