ServicesAboutNotesContact Get in touch →
EN FR
Note

Customer 360 dbt DAG Architecture

How to structure a dbt project for Customer 360 models — the identity resolution layer between base and mart, the wide customer table, and materialization choices.

Planted
dbtbigqueryga4data modelingdata engineering

A Customer 360 model unifies data about a single customer from multiple source systems into one row. The challenge isn’t the final wide table — it’s building the identity bridge that connects systems with no shared key. The standard three-layer pattern (base, intermediate, mart) needs an explicit identity resolution layer to make this work.

The DAG structure

The architecture adds an identity layer between base and mart models:

Sources: ga4.events → base__ga4__events
crm.contacts → base__crm__contacts
crm.deals → base__crm__deals
Identity: base__ga4 + base__crm → int__identity_resolved
→ int__identity_device_bridged
Marts: int__identity + base__ga4 → mrt__marketing__web_sessions
mrt__web_sessions + base__deals → mrt__marketing__attributed_touchpoints
all sources + identity → mrt__core__customer_360

This is the same three layers, not four. The identity models are intermediate models — they join base models together and add computed fields without changing the fundamental grain (one row per identity pair). They just happen to sit in a subdirectory that makes their special purpose obvious.

Base models

Base models do the standard work: flatten the GA4 event_params (UNNEST patterns for UTMs, page locations, session IDs), rename CRM columns to business-friendly names, and filter _fivetran_deleted records. One model per source table, materialized as views or tables.

For GA4 events specifically, the base model extracts the key fields needed for identity resolution:

-- base__ga4__events.sql
SELECT
event_date,
event_timestamp,
event_name,
user_pseudo_id,
user_id,
(SELECT value.string_value
FROM UNNEST(event_params)
WHERE key = 'page_location') AS page_location,
(SELECT value.int_value
FROM UNNEST(event_params)
WHERE key = 'ga_session_id') AS ga_session_id,
privacy_info.analytics_storage AS consent__analytics_storage,
-- ... other fields
FROM {{ source('ga4', 'events') }}
WHERE (SELECT value.int_value
FROM UNNEST(event_params)
WHERE key = 'ga_session_id') IS NOT NULL

Note the privacy_info.analytics_storage extraction. This becomes critical when your models need to respect consent requirements — only events with granted consent should flow into identity resolution that links to CRM data.

The identity layer

This is the new piece that standard dbt projects don’t have. Two models handle it:

int__identity_resolved joins form submission data (where user_pseudo_id was captured alongside an email), login events (where user_id was set), and CRM contact records. The output is a mapping table: one row per known identity pair.

-- int__identity_resolved.sql
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key=['identifier_type', 'identifier_value']
) }}
WITH form_identities AS (
-- user_pseudo_id → email → crm_contact_id
SELECT DISTINCT
user_pseudo_id,
crm.contact_id AS crm_contact_id,
'form_submission' AS identity_source,
form.submitted_at AS identified_at
FROM {{ ref('base__form_submissions') }} form
INNER JOIN {{ ref('base__crm__contacts') }} crm
ON LOWER(TRIM(form.email)) = LOWER(TRIM(crm.email))
WHERE form.user_pseudo_id IS NOT NULL
),
login_identities AS (
-- user_id → user_pseudo_id, linked via backstitching
SELECT DISTINCT
events.user_pseudo_id,
crm.contact_id AS crm_contact_id,
'login' AS identity_source,
events.event_timestamp AS identified_at
FROM {{ ref('base__ga4__events') }} events
INNER JOIN {{ ref('base__crm__contacts') }} crm
ON events.user_id = crm.external_id
WHERE events.user_id IS NOT NULL
),
all_identities AS (
SELECT * FROM form_identities
UNION ALL
SELECT * FROM login_identities
)
SELECT
user_pseudo_id,
crm_contact_id,
identity_source,
identified_at
FROM all_identities
QUALIFY ROW_NUMBER() OVER (
PARTITION BY user_pseudo_id
ORDER BY identified_at DESC
) = 1

int__identity_device_bridged fans this out so you can go from any identifier to the canonical customer_id. If you use dbt_utils.generate_surrogate_key(), hash a combination of source system and source ID:

{{ dbt_utils.generate_surrogate_key(['source_system', 'source_id']) }} AS customer__surrogate_key

These models use the merge incremental strategy since identity mappings are relatively small (at most one row per known identity pair) but update frequently as new form submissions and logins create new mappings.

Mart models

Mart models consume the identity bridge. The key model is mrt__core__customer_360 — the wide table that combines contact information from the CRM, deal aggregates, engagement metrics, and web analytics summaries into a single row per customer.

-- mrt__core__customer_360.sql
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='customer_id'
) }}
SELECT
identity.crm_contact_id AS customer_id,
-- CRM contact fields (priority-based resolution)
COALESCE(crm.email, product.email) AS customer__email,
COALESCE(crm.company_name, product.company_name) AS customer__company_name,
crm.job_title AS customer__job_title,
-- Deal aggregates
deals.total_deal_value,
deals.deals_won,
deals.latest_deal_closed_at,
-- Web analytics summaries
web.total_sessions,
web.total_page_views,
web.first_seen_at,
web.last_seen_at,
web.days_active
FROM {{ ref('int__identity_device_bridged') }} identity
LEFT JOIN {{ ref('base__crm__contacts') }} crm
ON identity.crm_contact_id = crm.contact_id
LEFT JOIN {{ ref('int__deal_aggregates') }} deals
ON identity.crm_contact_id = deals.contact_id
LEFT JOIN {{ ref('int__web_analytics_summary') }} web
ON identity.user_pseudo_id = web.user_pseudo_id

The conflict resolution patterns (COALESCE for priority-based, recency comparisons, source-specific fields) all happen in this mart or in the intermediate models that feed it.

mrt__core__customer_360 also uses merge, keyed on the canonical customer_id. It’s a wide table but rarely exceeds a few million rows, so full refreshes are an option if merge costs become a concern.

Materialization choices

The different model types in the DAG have different materialization needs:

ModelStrategyRationale
base__ga4__eventsinsert_overwrite, 3-day lookbackGA4 events arrive late; re-processing last 3 partitions catches stragglers
int__identity_resolvedmerge on identifier pairSmall table, frequent updates, needs deduplication
int__identity_device_bridgedmerge on identifier pairSame pattern as identity_resolved
mrt__core__customer_360merge on customer_idWide but small; full refresh is viable fallback
mrt__marketing__web_sessionsinsert_overwrite on session dateTime-partitioned fact table

For CRM mart tables, partition by date fields (created_at, closed_at) and cluster by frequently filtered columns (account_id, owner_id). Clustering alone can reduce scanned data by 88% or more for incremental models on BigQuery.

The JaffleGaggle reference pattern

The dbt Labs “JaffleGaggle” reference implementation follows a similar architecture: define entities (contacts, accounts), extract email domains, flag personal vs. corporate emails, use seed files for edge cases that need human judgment (like shared email domains that aren’t really corporate), then aggregate usage data per entity into wide mart tables.

The seed file pattern is worth noting. Some identity resolution edge cases — gmail.com is personal, acme.com is corporate, but freelancer.com could be either — require human judgment that doesn’t belong in SQL. A seed file with these classifications keeps the logic auditable and maintainable.

Connecting to attribution

With identity resolution in place, you can build a touchpoint timeline per user from GA4 session data (with source, medium, and campaign from the traffic source fields), then join to CRM deal conversions using the resolved_user_id from your identity bridge and a conversion window.

The mrt__marketing__attributed_touchpoints model bridges web sessions and CRM deals:

SELECT
sessions.resolved_user_id,
sessions.session_date,
sessions.session_source,
sessions.session_medium,
deals.deal_id,
deals.deal_amount,
deals.closed_at
FROM {{ ref('mrt__marketing__web_sessions') }} sessions
INNER JOIN {{ ref('int__identity_device_bridged') }} identity
ON sessions.user_pseudo_id = identity.user_pseudo_id
INNER JOIN {{ ref('base__crm__deals') }} deals
ON identity.crm_contact_id = deals.contact_id
WHERE sessions.session_date
BETWEEN DATE_SUB(deals.closed_at, INTERVAL 90 DAY)
AND deals.closed_at

Apply attribution models (first-touch, last-touch, linear, time-decay) on the touchpoint table this produces, aggregate by channel or campaign, and you get ROI by source.