This note covers the production dbt DAG structure for GA4 identity resolution. Two models are involved: an incremental identity mapping table (maintained via merge) and a stitched events table (maintained via insert_overwrite). Both use a 3-day lookback window for late-arriving data.
DAG structure
The identity resolution layer sits between your raw GA4 source and your session/attribution marts:
sources/ga4/events │ ├── int__ga4__identity_mapping (incremental, merge) │ └── int__ga4__events_stitched (incremental, insert_overwrite) │ ├── int__ga4__sessions │ └── mrt__analytics__user_journeyTwo models, two different incremental strategies, for two different reasons:
- Identity mapping is small (one row per device that ever authenticated) and updates by merging new or updated rows.
mergestrategy is the right choice. - Stitched events is large (every event in your history) and updates by reprocessing recent date partitions.
insert_overwriteis the right choice.
The identity mapping model
-- models/intermediate/ga4/int__ga4__identity_mapping.sql
{{ config( materialized='incremental', unique_key='user_pseudo_id', incremental_strategy='merge', partition_by={ 'field': 'last_seen_date', 'data_type': 'date', 'granularity': 'day' }, cluster_by=['user_pseudo_id'] )}}
WITH source_events AS ( SELECT user_pseudo_id, user_id, event_timestamp, PARSE_DATE('%Y%m%d', event_date) AS event_date FROM {{ source('ga4', 'events') }} WHERE user_pseudo_id IS NOT NULL AND user_id IS NOT NULL AND user_id NOT IN ('null', 'undefined', 'none', '') {% if is_incremental() %} AND PARSE_DATE('%Y%m%d', event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) {% endif %}),
aggregated AS ( SELECT user_pseudo_id, ARRAY_AGG(DISTINCT user_id IGNORE NULLS) AS all_user_ids, ARRAY_AGG(user_id ORDER BY event_timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] AS latest_user_id, MAX(event_timestamp) AS last_seen_timestamp, MAX(event_date) AS last_seen_date FROM source_events GROUP BY user_pseudo_id)
SELECT user_pseudo_id, latest_user_id AS resolved_user_id, all_user_ids, ARRAY_LENGTH(all_user_ids) AS user_id_count, ARRAY_LENGTH(all_user_ids) > 1 AS has_multiple_users, TIMESTAMP_MICROS(last_seen_timestamp) AS last_seen_at, last_seen_dateFROM aggregatedThe 3-day lookback window in is_incremental() handles late-arriving GA4 events. GA4 can deliver events up to 72 hours after they occur; the 3-day window ensures they get captured in the identity mapping.
The user_id filter removes known garbage values before they pollute the mapping. Extend this list based on your own user_id audit.
resolved_user_id takes the most recent user_id as the canonical identity for the device. all_user_ids preserves the full history for auditability. has_multiple_users flags shared devices for downstream handling.
The stitched events model
-- models/intermediate/ga4/int__ga4__events_stitched.sql
{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={ 'field': 'event_date', 'data_type': 'date', 'granularity': 'day' }, cluster_by=['resolved_user_id', 'session_id'] )}}
WITH events AS ( SELECT user_pseudo_id, user_id, event_name, event_timestamp, event_params, PARSE_DATE('%Y%m%d', event_date) AS event_date, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id FROM {{ source('ga4', 'events') }} WHERE TRUE {% if is_incremental() %} AND PARSE_DATE('%Y%m%d', event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) {% endif %}),
identity_mapping AS ( SELECT user_pseudo_id, resolved_user_id, has_multiple_users FROM {{ ref('int__ga4__identity_mapping') }})
SELECT e.user_pseudo_id, e.user_id, e.event_name, e.event_timestamp, e.event_date,
-- Identity resolution COALESCE(m.resolved_user_id, e.user_pseudo_id) AS resolved_user_id, m.resolved_user_id IS NOT NULL AS is_identified, COALESCE(m.has_multiple_users, FALSE) AS is_shared_device,
-- Session identifier (composite to prevent collisions) e.ga_session_id, CONCAT( COALESCE(m.resolved_user_id, e.user_pseudo_id), '_', CAST(e.ga_session_id AS STRING) ) AS session_id
FROM events eLEFT JOIN identity_mapping m USING (user_pseudo_id)The COALESCE(m.resolved_user_id, e.user_pseudo_id) fallback ensures every event has a non-null resolved_user_id. Anonymous events use their device ID as their identity rather than becoming null — which would make all anonymous events look like they belong to one user.
The session_id construction uses the stitched identity, not the raw user_pseudo_id. This means sessions from the same person on the same device before and after authentication share a session namespace, which matters for accurate session counting.
is_shared_device surfaces the shared-device flag so downstream marts can handle these events appropriately — either filtering them or weighting their contribution lower.
Schema tests
Tests for the identity mapping model focus on uniqueness and the upper bound guard:
models: - name: int__ga4__identity_mapping columns: - name: user_pseudo_id tests: - unique - not_null - name: resolved_user_id tests: - not_null - name: user_id_count tests: - dbt_utils.accepted_range: min_value: 1 max_value: 100 # Alert if a device maps to implausibly many users
- name: int__ga4__events_stitched columns: - name: session_id tests: - not_null - name: resolved_user_id tests: - not_nullThe accepted_range test on user_id_count is the automated guard against the tracking bugs that cause identity explosion. A device mapping to more than 100 user_id values is almost certainly a string null bug or test account issue, not a real user. The test failing in CI catches this before it corrupts your production mapping.
Why merge for mapping, insert_overwrite for events
Incremental strategy choice matters here because the two models have different update patterns.
The identity mapping has user_pseudo_id as its unique key. When a device that mapped to user_id = A last month now maps to user_id = B (account change, shared device), you want the row to be updated in place. merge on unique_key = 'user_pseudo_id' does exactly this.
The stitched events table has no single unique key — it’s a fact table with billions of rows. Merging at that scale is prohibitively expensive. Instead, insert_overwrite replaces recent date partitions wholesale. When new identity mappings resolve events that were previously anonymous, the 3-day lookback window reprocesses those partitions with the updated mapping — correcting the stitching retroactively.
This is why the identity mapping model must run before the stitched events model in your DAG. The ref('int__ga4__identity_mapping') dependency enforces this in dbt automatically.