Adrienne Vermorel
dbt Project Structure: The Definitive Guide
dbt project structure is one of the first decisions you make when starting a new project, and one of the hardest to change later. Get it wrong and six months from now you’ll be staring at a tangled DAG, wondering which of your 47 base models actually feeds the dashboard that just broke.
The challenge is that much of what you’ll find online is either outdated (referencing patterns that dbt Labs has since walked back) or too theoretical (explaining what to do without explaining why it matters). This guide is different. It’s based on auditing projects with 400+ models and building marketing analytics pipelines from scratch. These patterns have been stress-tested on real client projects.
This article covers a complete, opinionated framework for structuring dbt projects. It starts with foundations for beginners, then takes clear stances on the debates that experienced practitioners actually care about. It goes beyond just the models/ folder to cover macros, tests, seeds, snapshots, and YAML organization. Everything is grounded in practical examples from marketing analytics: GA4, advertising platforms, and attribution modeling.
The thesis is simple: consistency matters more than any specific convention, but you need a convention to be consistent with. The goal is a structure where anyone on your team can find any model in under 10 seconds.
Time to build that structure.
The Three-Layer Architecture: base → intermediate → marts
The foundation of any well-structured dbt project is a clear layered architecture. Data should flow from source-conformed (shaped by external systems you don’t control) to business-conformed (shaped by your organization’s needs and definitions).
The layers I use are base, intermediate, and marts. You’ll see “staging” used instead of “base” in much of the dbt documentation, but I prefer base/intermediate/marts for a simple reason: alphabetical order matches lineage order. When you open your models/ folder, you see base/ first, then intermediate/, then marts/. That’s exactly the order data flows through your project. It’s a small thing, but it reduces cognitive load every single day.
The base layer
The base layer is where source data enters your dbt project. Each base model has a 1-to-1 relationship with a source table. No exceptions.
What belongs in base models:
- Renaming columns to consistent conventions (snake_case, removing prefixes like
c_ortbl_) - Type casting (strings to dates, integers to booleans)
- Basic unit conversions (cents to dollars, milliseconds to seconds)
- Timezone conversions
- Filtering out rows you never want (test data, hard deletes via
WHERE NOT _fivetran_deleted) - Handling source-specific quirks (unnesting repeated fields, deduplicating on primary key)
What does NOT belong in base models:
- Joins. Never. If you need data from another source, that’s an intermediate model.
- Aggregations. Never. Base models should maintain the same grain as the source.
- Business logic. If you’re writing CASE statements with business rules, you’re in the wrong layer.
The base layer should be mechanical. Someone unfamiliar with your business domain should be able to understand every transformation just by reading the SQL.
Naming convention: base__[source]__[entity]
The double underscore between each component is intentional. It creates unambiguous visual separation. base__google_analytics__event is immediately parseable; base_google_analytics_event could be base_google + analytics_event or base + google_analytics_event.
Folder structure:
models/└── base/ ├── stripe/ │ ├── _stripe__sources.yml │ ├── _stripe__models.yml │ ├── base__stripe__payment.sql │ └── base__stripe__customer.sql └── ga4/ ├── _ga4__sources.yml ├── _ga4__models.yml └── base__ga4__event.sqlOrganize base models by source system, not by business domain or data loader. Use stripe/, not fivetran/ or payment/. The source system is the natural organizing principle at this layer because that’s where the data actually comes from.
Marketing analytics example: A base__ga4__event model takes the raw GA4 export, unnests the event_params repeated field into proper columns, casts event_timestamp from microseconds to a proper timestamp, filters out debug_mode = 1 events, and renames user_pseudo_id to ga_client_id. No sessionization, no attribution, no business logic.
The intermediate layer
The intermediate layer is where base models combine to form coherent entities. This is where your project’s intellectual work happens.
My opinionated take: intermediate models should be entity-centric, not transformation-centric.
The conventional dbt guidance suggests naming intermediate models after the transformation they perform: int_payments_pivoted_to_order, int_events_sessionized, int_customers_aggregated_by_account. I think this is wrong.
Instead, intermediate models should represent clean, well-defined entities at a specific grain. There are two types:
- Pure entity models (
int__session,int__customer) when you need to apply business logic to a single base model (deduplication, sessionization, complex calculations) - Enriched entity models (
int__customer__customer_lj_order) when you’re joining entities together
The enriched model naming convention is self-documenting: int__[primary_entity]__[entity1]_[join_type]_[entity2]. The join type abbreviations are:
lj= LEFT JOINij= INNER JOINcj= CROSS JOIN
For multiple joins, chain them: int__customer__customer_lj_order_lj_session.
Why does this matter?
- Self-documenting. You can read the model name and know exactly what’s joined and how. No need to open the SQL to understand the grain and composition.
- Reusability. Entity-based models like
int__sessioncan be referenced by multiple marts. The enriched versions make the dependencies explicit. - Debuggability. When something breaks, the name tells you where to look.
int__customer__customer_lj_orderfailing? Check the customer-to-order join logic.
What belongs in intermediate models:
- Joining base models to create enriched entities
- Sessionization logic (grouping events into sessions)
- Deduplication across sources (merging customer records from CRM and billing)
- Identity resolution
- Complex business logic that shouldn’t clutter marts
Important: Don’t create an int__customer model if it’s identical to base__crm__customer. Only create intermediate models when you’re adding value through joins, business logic, or transformations. Duplicating base models wastes compute and clutters your DAG.
Unlike base models (organized by source) or marts (organized by business domain), intermediate is organized by entity. If you have enough models, you can create subfolders by entity (intermediate/customer/, intermediate/session/) but never by business domain like intermediate/marketing/ or intermediate/finance/. Intermediate models often span domains: int__customer__customer_lj_order serves both finance and marketing. Forcing it into one domain’s folder creates artificial boundaries.
Folder structure:
models/└── intermediate/ ├── _int__models.yml ├── session/ │ ├── int__session.sql │ └── int__session__session_lj_conversion.sql ├── customer/ │ ├── int__customer__customer_lj_order.sql │ └── int__customer__customer_lj_order_lj_session.sql ├── int__unified_ad_spend.sql └── int__attribution_touchpoint.sqlFor smaller projects, keep it flat. Add entity subfolders when you have 3+ models for the same entity.
Marketing analytics example: An int__session model takes base__ga4__event, applies your 30-minute session timeout logic, calculates session duration, and assigns each session to a traffic source. Then int__session__session_lj_conversion enriches sessions with conversion data (the name tells you exactly what you’re getting).
The marts layer
The marts layer produces tables ready for consumption by end users: analysts, dashboards, reverse ETL, ML models.
Another opinionated stance: I don’t use fct_ and dim_ prefixes.
Dimensional modeling (facts and dimensions, star schemas, Kimball methodology) was designed for OLAP cubes and the constraints of 1990s data warehouses. Those constraints no longer exist. Modern BI tools handle wide, denormalized tables perfectly well. Your analysts don’t care whether customer is technically a dimension; they just want to query customer data.
The prefixes add ceremony without adding clarity. fct_order tells you nothing that order doesn’t. And the mental overhead of deciding “is this a fact or a dimension?” is a distraction from the actual work of modeling your data well.
When fct_/dim_ might still make sense: If you’re in an enterprise environment with strict data governance requirements, existing Kimball conventions, or a team that thinks in dimensional modeling terms, keep the prefixes. Consistency with your organization’s norms beats my preferences.
What belongs in marts:
- Wide, denormalized tables ready for dashboards
- Pre-aggregated metrics
- Domain-specific business logic
- Final business naming (customer-friendly column names)
Naming convention: mrt__[department]__[entity]. For example: mrt__marketing__campaign_performance, mrt__finance__order, mrt__marketing__session.
Folder structure (by business domain):
models/└── marts/ ├── finance/ │ ├── _finance__models.yml │ ├── mrt__finance__order.sql │ └── mrt__finance__revenue.sql └── marketing/ ├── _marketing__models.yml ├── mrt__marketing__session.sql ├── mrt__marketing__campaign_performance.sql └── mrt__marketing__channel_attribution.sqlAt the marts layer, organize by business domain (the teams and use cases that will consume the data). Marketing doesn’t care that session data comes from GA4; they care that it’s marketing data.
Marketing analytics example: A mrt__marketing__campaign_performance mart joins int__unified_ad_spend with conversion data from int__conversion, calculates ROAS, CPA, and other marketing metrics, and produces a table with one row per campaign per day. It’s ready to plug directly into a dashboard or export to a spreadsheet.
When do you need all three layers?
Not every project needs all three layers.
Simple projects (< 20 models): Base + marts may suffice. If your marts are straightforward aggregations of single base models, an intermediate layer adds unnecessary complexity.
Complex projects: Once you’re joining 3+ base models in a single mart, extract the shared logic into intermediate. Once you have multiple marts that need the same joined/transformed data, intermediate becomes essential for maintainability.
Rule of thumb: If you’re copying and pasting JOIN logic between marts, you need an intermediate model.
Naming Conventions: The Double Underscore Standard
Naming conventions are one of those topics where consistency matters far more than which specific convention you choose. That said, here’s what I recommend:
| Layer | Pattern | Example |
|---|---|---|
| Base | base__[source]__[entity] | base__stripe__payment |
| Intermediate (pure) | int__[entity] | int__session |
| Intermediate (enriched) | int__[entity]__[entity1]_[join]_[entity2] | int__customer__customer_lj_order |
| Marts | mrt__[department]__[entity] | mrt__marketing__campaign_performance |
Why double underscores?
The double underscore (__) creates unambiguous visual separation between components. Consider these alternatives:
base__google_analytics__campaign— clearly: base layer, google_analytics source, campaign entitybase_google_analytics_campaign— is this google + analytics_campaign? or google_analytics + campaign?
With multi-word source or entity names, the ambiguity compounds. Double underscores eliminate it.
This convention is also consistent with dbt’s source syntax ({{ source('stripe', 'payment') }}) and the broader dbt community’s practices.
The enriched model naming pattern
For intermediate models that join multiple entities, the naming convention is:
int__[primary_entity]__[entity1]_[join_type]_[entity2]_[join_type]_[entity3]
Join type abbreviations:
lj= LEFT JOINij= INNER JOINcj= CROSS JOIN
Examples:
int__customer__customer_lj_order(customer left joined with order)int__session__session_lj_conversion(session left joined with conversion)int__customer__customer_lj_order_lj_session(customer → order → session chain)
This convention is verbose, but it’s self-documenting. You know the grain (primary entity), what data is included (joined entities), and how it’s joined (join types). No need to open the SQL file to understand what you’re getting.
Singular entity names
Use singular names: customer, order, session, campaign.
The rationale: entities represent a single concept. A customer table contains customer records. Each row is one customer. The entity is “customer”, not “customers”. This also keeps naming consistent across all layers: base__stripe__customer, int__customer, and mrt__finance__customer all refer to the same entity.
Source naming: use the system, not the loader
Name your sources after the source system, not the data loader:
- ✓
stripe,ga4,hubspot,shopify - ✗
fivetran_stripe,stitch_hubspot,airbyte_shopify
Your loader might change. Your source system is (probably more) stable.
Materialization Strategy: Tables All the Way Down
The conventional wisdom for dbt materialization goes something like this:
| Layer | Conventional | Rationale |
|---|---|---|
| Staging/Base | view | Fresh data, save storage |
| Intermediate | ephemeral | Just CTEs, not real tables |
| Marts | table | Performance for end users |
I think this is wrong. My approach instead:
| Layer | My approach | Rationale |
|---|---|---|
| Base | table | Debuggable, stable |
| Intermediate | table | Debuggable, queryable |
| Marts | table | Performance |
Everything is a table. The only variations are:
- Incremental when volume is high (millions of rows, daily appends)
- View when data must be fresh within minutes (rare)
The case for tables everywhere
Debugging. When something breaks (and something will break), you need to query intermediate results. With ephemeral models, those results don’t exist in your warehouse. You can’t SELECT * FROM int__session LIMIT 100 to check your sessionization logic. You’re flying blind.
Stability. Views are re-evaluated on every query. If your upstream source changes schema, every downstream view breaks immediately. Tables act as a buffer: you have time to notice and fix issues before they cascade.
Cost. Storage is cheap. Absurdly cheap. BigQuery charges $0.02 per GB per month. A table with 10 million rows might cost you $0.50/month to store. Meanwhile, recomputing views on every query costs compute, and compute is not cheap.
Simplicity. One materialization strategy is easier to understand, teach, and maintain than three.
When to use incremental
Switch to incremental materialization when:
- Tables exceed millions of rows
- Daily appends are the dominant pattern
- Full refresh takes more than 5 minutes
For marketing analytics, this typically means:
- GA4 events tables (incremental with 3-day lookback for late-arriving events)
- Ad platform daily extracts (incremental by date, with a 30-day lookback)
- Transaction logs
Important: Incremental models are more complex. They require careful thought about your incremental strategy, late-arriving data, and edge cases. Don’t use incremental until table materialization becomes a bottleneck.
The one exception: views for real-time freshness
Sometimes data genuinely needs to be fresh within minutes, not hours. Live campaign dashboards during a product launch. Real-time inventory levels. Fraud detection.
In these cases, views make sense because you’re accepting the query-time compute cost in exchange for freshness.
But this is rare. Most analytics use cases tolerate hourly or daily latency just fine. Don’t optimize for real-time unless you actually need it.
Configuration in dbt_project.yml
models: my_project: +materialized: table # Default everything to table base: +schema: base ga4: +materialized: incremental # High volume +incremental_strategy: insert_overwrite intermediate: +schema: intermediate marts: +schema: martsSet the default to table, then override only where needed.
YAML Organization: One File Per Directory
Three approaches exist for organizing your .yml files:
- Per-directory:
_stripe__sources.yml,_stripe__models.ymlin each folder - Per-model: One YAML file per SQL file
- Monolithic: One giant
schema.ymlat the project root
Option 3 is a nightmare. Don’t do it. A 2000-line schema.yml is unsearchable, unmaintainable, and generates merge conflicts constantly.
Option 2 (per-model) is gaining popularity in large projects. It scales well because each model is self-contained. The downside: more files, and you can’t use YAML anchors to reduce repetition.
I recommend option 1: per-directory. It’s the official dbt Labs recommendation, it balances discoverability with maintainability, and it works well for projects up to several hundred models.
The per-directory pattern
base/stripe/├── _stripe__sources.yml # Source definitions├── _stripe__models.yml # Model configs, tests, docs├── base__stripe__payment.sql└── base__stripe__customer.sqlWhy the underscore prefix? Files starting with _ sort to the top of the directory. Your YAML files appear before your SQL files.
Why include the directory name? Faster fuzzy-finding in editors. Searching for stripe__models is faster than searching for _models and scanning results.
What goes where
_sources.yml:
- Source definitions (
sources:) - Source freshness tests
- Source-level documentation
_models.yml:
- Model-level configurations
- Column tests (unique, not_null, relationships, accepted_values)
- Column descriptions
- Contracts (if using)
Keep them separate. Mixing source and model definitions in one file creates confusion.
Beyond models/: The Full Project Structure
A dbt project is more than just models. The complete structure:
my_project/├── analyses/│ └── migration_validation.sql├── dbt_project.yml├── macros/│ ├── _macros.yml│ ├── generate_schema_name.sql│ └── marketing/│ ├── channel_grouping.sql│ └── attribution_weight.sql├── models/│ ├── base/│ ├── intermediate/│ └── marts/├── packages.yml├── seeds/│ ├── _seeds.yml│ ├── channel_mapping.csv│ └── utm_campaign_mapping.csv├── snapshots/│ ├── _snapshots.yml│ └── snap__customer.sql└── tests/ └── assert_attribution_sums_to_one.sqlmacros/
Macros are reusable SQL snippets (functions you can call from any model).
Organization: Group by domain or function.
macros/├── _macros.yml # Document all macros├── generate_schema_name.sql├── grant_permissions.sql└── marketing/ ├── channel_grouping.sql └── session_timeout.sqlCritical: Always document your macros in _macros.yml. Include the purpose, arguments, and an example. Future you will thank present you.
macros: - name: channel_grouping description: | Maps UTM source/medium combinations to standardized channel groups. Returns: 'Paid Search', 'Organic Search', 'Paid Social', etc. arguments: - name: source type: string description: The traffic source (utm_source or ga source) - name: medium type: string description: The traffic medium (utm_medium or ga medium)seeds/
Seeds are CSV files that dbt loads into your warehouse. Use them for static lookup tables that don’t exist in any source system.
Good use cases:
- UTM → campaign mappings
- Channel grouping rules
- Country/region codes
- Internal IP addresses to exclude from analytics
- Employee email → customer ID mappings (for filtering internal usage)
Bad use cases:
- Loading actual data. If it exists in a system you have access to, use a proper EL tool.
- Large datasets. Seeds are loaded entirely on every
dbt seedrun.
Marketing examples:
source,medium,channel_groupgoogle,cpc,Paid Searchgoogle,organic,Organic Searchfacebook,paid,Paid Socialfacebook,referral,Organic Socialsnapshots/
Snapshots create Type 2 slowly changing dimension records, preserving historical states when source data changes.
Use cases:
- Customer attributes that change over time (tier, status, owner)
- Campaign settings history
- Product pricing history
- Any dimension where you need to answer “what was the value at time X?”
Naming: snap__[entity] → snap__customer.sql
As of dbt 1.9+, you can define snapshots in YAML files instead of SQL, which simplifies the syntax considerably.
tests/
dbt has two types of tests:
Generic tests (defined in YAML) are your primary testing mechanism:
models: - name: int__session columns: - name: session_id tests: - unique - not_null - name: ga_client_id tests: - not_null - relationships: to: ref('base__ga4__event') field: ga_client_idSingular tests (SQL files in tests/) are for complex validations spanning multiple models:
-- tests/assert_attribution_sums_to_one.sql-- Each conversion's attribution weights should sum to 1.0
SELECT conversion_idFROM {{ ref('mrt__marketing__channel_attribution') }}GROUP BY conversion_idHAVING ABS(SUM(attribution_weight) - 1.0) > 0.001Minimum standard: Every model should have its primary key tested for unique and not_null. This catches a surprising number of issues.
analyses/
The analyses/ folder stores SQL queries that benefit from Jinja templating and version control but don’t create warehouse objects.
Use cases:
- Migration validation queries
- One-off audit queries
- Ad-hoc analysis you want to preserve
- Queries you run manually but want to track in git
Modern dbt Features (1.7+): What to Adopt
dbt has evolved significantly in recent versions. What’s worth adopting in 2026:
Contracts: Yes, for marts
Model contracts define explicit guarantees about a model’s schema (column names, data types, and constraints). When enforced, dbt validates these at build time.
models: - name: mrt__marketing__campaign_performance config: contract: enforced: true columns: - name: campaign_id data_type: string constraints: - type: not_null - name: date data_type: date constraints: - type: not_null - name: spend data_type: float64When to use: Public-facing marts consumed by dashboards, downstream systems, or other teams. Contracts act as an API schema that communicates what downstream consumers can rely on.
When to skip: Base and intermediate layers. The ceremony isn’t worth it for internal models that only you consume.
Unit testing: Yes, for complex logic
Native unit testing (dbt 1.8+) lets you test SQL logic against static inputs before materialization.
unit_tests: - name: test_channel_grouping_paid_search model: int__session given: - input: ref('base__ga4__event') rows: - {session_id: 'abc', source: 'google', medium: 'cpc'} expect: rows: - {session_id: 'abc', channel_group: 'Paid Search'}When to use:
- Complex CASE statements
- Date/time calculations
- Attribution logic
- Channel grouping rules
- Any business logic where edge cases matter
When to skip: Simple transformations where the logic is obvious.
Groups and access modifiers: Yes, even for single projects
Groups organize models under common ownership and enable access control.
groups: - name: marketing owner: email: marketing-data@company.com - name: finance owner: email: finance-data@company.commodels: my_project: marts: marketing: +group: marketing +access: public finance: +group: finance +access: publicWhy bother in a single project?
- Documentation. Groups explicitly record who owns what.
- Future-proofing. If you ever migrate to dbt Mesh, the structure is already in place.
- Boundaries. Access modifiers (
private,protected,public) enforce that teams don’t create unintended dependencies.
dbt Mesh: Know when you need it
dbt Mesh enables splitting a monolithic project into multiple interconnected projects. It’s powerful, but it’s also complex.
Signs you might need it:
- 500+ models with slow CI
- Multiple teams stepping on each other
- Security requirements for data isolation
- Different deployment cadences for different domains
For most projects: A single project with groups is sufficient. Don’t adopt Mesh complexity until you have Mesh problems.
dbt-project-evaluator: Yes, for enforcement
The dbt-project-evaluator package automatically audits your project against best practices. It checks for:
- Missing primary key tests
- Models without descriptions
- Direct source references in marts (should go through base)
- Naming convention violations
- And dozens more rules
Add it to your packages.yml:
packages: - package: dbt-labs/dbt_project_evaluator version: 0.8.1Run dbt build --select package:dbt_project_evaluator to see violations.
Pro tip: Customize the rules to match your conventions. The package expects stg_ prefixes by default; you can configure it for base__ patterns.
Marketing Analytics: Practical Patterns
Time to ground everything in concrete marketing analytics examples.
GA4 modeling structure
models/├── base/ga4/│ ├── _ga4__sources.yml│ ├── _ga4__models.yml│ └── base__ga4__event.sql # Unnest, cast, filter debug├── intermediate/│ ├── int__session.sql # Sessionization logic│ ├── int__session__session_lj_conversion.sql # Session enriched│ └── int__conversion.sql # Conversion events└── marts/marketing/ ├── mrt__marketing__session.sql # Session-level metrics ├── mrt__marketing__user_journey.sql # Full user paths └── mrt__marketing__conversion_funnel.sql # Funnel analysisKey considerations:
- GA4 exports sharded tables (
events_YYYYMMDD). Use{{ source('ga4', 'event') }}with wildcard table configuration. - Event parameters are nested. Unnest them in the base layer.
- Sessionization requires a timeout rule (typically 30 minutes). Implement in intermediate.
- Use incremental models with 3-day lookback because GA4 events can arrive late.
Multi-platform ad spend consolidation
The challenge: Google Ads, Meta, LinkedIn, TikTok all have different schemas and different naming, but the same business need.
Pattern: Normalize in base, unify in intermediate.
models/├── base/│ ├── google_ads/│ │ └── base__google_ads__campaign_performance.sql│ ├── meta_ads/│ │ └── base__meta_ads__campaign_insights.sql│ └── linkedin_ads/│ └── base__linkedin_ads__campaign_analytics.sql└── intermediate/ └── int__ad__union.sqlThe int__ad__union model UNIONs all platforms into a standardized schema:
-- int__ads__union.sql
SELECT date, 'google_ads' AS platform, account_id, campaign_id, campaign_name, impressions, clicks, spend, conversionsFROM {{ ref('base__google_ads__campaign_performance') }}
UNION ALL
SELECT date, 'meta_ads' AS platform, account_id, campaign_id, campaign_name, impressions, clicks, spend, conversionsFROM {{ ref('base__meta_ads__campaign_insights') }}
-- ... additional platformsNow every mart that needs ad spend data references this single unified source.
Attribution modeling
Start simple: first-touch and last-touch attribution are table stakes. Add multi-touch when (and if) you need it.
Structure:
models/├── intermediate/│ ├── int__attribution_touchpoint.sql # All touchpoints with timestamps│ ├── int__attribution_path.sql # Ordered paths per conversion│ └── int__conversion__conversion_lj_touchpoint.sql # Conversion enriched with paths└── marts/marketing/ ├── mrt__marketing__first_touch_attribution.sql ├── mrt__marketing__last_touch_attribution.sql └── mrt__marketing__multi_touch_attribution.sqlThe int__attribution_touchpoint model gathers all touchpoints (sessions, ad clicks, email opens) with timestamps and user identifiers. The int__attribution_path model orders them into sequences. The int__conversion__conversion_lj_touchpoint model joins conversions with their attribution paths.
Marts then apply different attribution logic to the same underlying data, making it easy to compare models.
Handling incrementality in marketing data
Different data types need different incremental strategies:
| Data | Strategy | Lookback | Why |
|---|---|---|---|
| GA4 events | insert_overwrite by date | 3 days | Late-arriving events |
| Ad platform metrics | insert_overwrite by date | 30 days | Attribution windows update retroactively |
| Conversions | Full refresh or careful incremental | N/A | Attribution can change for weeks |
The lookback window matters. Ad platforms recalculate attribution as more data arrives: a conversion attributed to Campaign A on day 1 might shift to Campaign B by day 30. Build this into your incremental logic.
Common Mistakes and How to Avoid Them
Mistake: Skipping the intermediate layer
Symptom: Marts with 10+ JOINs, repeated logic across models, impossible to debug.
Fix: Extract shared logic into entity-based intermediate models. If multiple marts need session data, create int__session once.
Mistake: Organizing base by loader, not source
Wrong: fivetran/, stitch/, airbyte/
Right: stripe/, ga4/, hubspot/
Your loader is an implementation detail. Your source is the actual data.
Mistake: Business logic in the base layer
Symptom: Complex CASE statements, hardcoded channel mappings, business rules in staging models.
Fix: Base should be mechanical transformations only. Business logic belongs in intermediate or marts.
Mistake: Monolithic YAML files
Symptom: One 2000-line schema.yml at the project root.
Fix: One YAML file per directory. _stripe__models.yml in base/stripe/, _marketing__models.yml in marts/marketing/.
Mistake: Over-nesting folders
Wrong: models/staging/external/stripe/payments/v2/current/
Right: Keep folder depth ≤ 3 levels. Flat is better than nested.
Mistake: Ephemeral everything
Symptom: Can’t debug, can’t query intermediate results, no visibility into transformation logic.
Fix: Tables are cheap. Visibility is priceless. Materialize as tables by default.
Mistake: Skipping tests
Symptom: Duplicate primary keys, null values in NOT NULL columns, broken relationships, all discovered in production.
Fix: At minimum, test every primary key for unique and not_null. Add relationship tests for foreign keys.
Putting It All Together: A Complete Example
A complete marketing analytics project structure applying everything covered above:
marketing_analytics/├── dbt_project.yml├── packages.yml├── macros/│ ├── _macros.yml│ └── marketing/│ ├── channel_grouping.sql│ └── session_timeout_minutes.sql├── models/│ ├── base/│ │ ├── ga4/│ │ │ ├── _ga4__sources.yml│ │ │ ├── _ga4__models.yml│ │ │ └── base__ga4__event.sql│ │ ├── google_ads/│ │ │ ├── _google_ads__sources.yml│ │ │ ├── _google_ads__models.yml│ │ │ └── base__google_ads__campaign_performance.sql│ │ └── meta_ads/│ │ ├── _meta_ads__sources.yml│ │ ├── _meta_ads__models.yml│ │ └── base__meta_ads__campaign_insight.sql│ ├── intermediate/│ │ ├── _int__models.yml│ │ ├── session/│ │ │ ├── int__session.sql│ │ │ └── int__session__session_lj_conversion.sql│ │ ├── int__conversion.sql│ │ ├── int__ad__union.sql│ │ └── int__attribution_touchpoint.sql│ └── marts/│ └── marketing/│ ├── _marketing__models.yml│ ├── mrt__marketing__session.sql│ ├── mrt__marketing__campaign_performance.sql│ ├── mrt__marketing__channel_attribution.sql│ └── mrt__marketing__daily_ad_spend.sql├── seeds/│ ├── _seeds.yml│ └── channel_mapping.csv├── snapshots/│ └── snap__campaign_setting.sql└── tests/ └── assert_attribution_weights_valid.sqlAnd the key configuration in dbt_project.yml:
name: marketing_analyticsversion: '1.0.0'
vars: session_timeout_minutes: 30 attribution_lookback_days: 30
models: marketing_analytics: +materialized: table base: +schema: base ga4: +materialized: incremental +incremental_strategy: insert_overwrite intermediate: +schema: intermediate marts: +schema: marts marketing: +group: marketing +access: public
groups: - name: marketing owner: email: marketing-data@company.comConclusion
The best dbt project structure is one where anyone can find any model in under 10 seconds. Everything else (naming conventions, layer definitions, materialization strategies) serves that goal.
Key takeaways:
- base → intermediate → marts. Three layers. Alphabetical order matches lineage order. Each layer has clear rules about what belongs there.
- Entity-based naming with self-documenting joins.
int__sessionfor pure entities,int__customer__customer_lj_orderfor enriched ones. The name tells you what’s inside. - No fct_/dim_ prefixes. We’ve moved past dimensional modeling. Mart names like
mrt__marketing__sessionare clearer. - Tables everywhere. Storage is cheap. Debugging visibility is expensive. Materialize as tables by default, incremental when volume demands it.
- Consistency is king. Pick conventions, document them, enforce them with dbt-project-evaluator. The specific conventions matter less than applying them consistently.
If you’re starting a new project, adopt these patterns from day one. If you’re inheriting an existing project, pick your battles and don’t try to refactor everything at once. Start with naming conventions in new models, then gradually migrate the rest.
Whatever conventions you choose, write them down. A CONTRIBUTING.md in your dbt project root that documents your structure decisions is worth more than any best practices guide, including this one.