ServicesAboutNotesContact Get in touch →
EN FR
Note

dbt Identity Resolution Pipeline

Production dbt DAG structure for GA4 identity resolution — the incremental identity mapping model, stitched events model, schema tests, and the 3-day lookback window for late-arriving data.

Planted
ga4bigquerydbtdata modelingdata engineeringincremental processing

This note covers the production dbt DAG structure for GA4 identity resolution. Two models are involved: an incremental identity mapping table (maintained via merge) and a stitched events table (maintained via insert_overwrite). Both use a 3-day lookback window for late-arriving data.

DAG structure

The identity resolution layer sits between your raw GA4 source and your session/attribution marts:

sources/ga4/events
├── int__ga4__identity_mapping (incremental, merge)
└── int__ga4__events_stitched (incremental, insert_overwrite)
├── int__ga4__sessions
└── mrt__analytics__user_journey

Two models, two different incremental strategies, for two different reasons:

  • Identity mapping is small (one row per device that ever authenticated) and updates by merging new or updated rows. merge strategy is the right choice.
  • Stitched events is large (every event in your history) and updates by reprocessing recent date partitions. insert_overwrite is the right choice.

The identity mapping model

-- models/intermediate/ga4/int__ga4__identity_mapping.sql
{{
config(
materialized='incremental',
unique_key='user_pseudo_id',
incremental_strategy='merge',
partition_by={
'field': 'last_seen_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by=['user_pseudo_id']
)
}}
WITH source_events AS (
SELECT
user_pseudo_id,
user_id,
event_timestamp,
PARSE_DATE('%Y%m%d', event_date) AS event_date
FROM {{ source('ga4', 'events') }}
WHERE user_pseudo_id IS NOT NULL
AND user_id IS NOT NULL
AND user_id NOT IN ('null', 'undefined', 'none', '')
{% if is_incremental() %}
AND PARSE_DATE('%Y%m%d', event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
{% endif %}
),
aggregated AS (
SELECT
user_pseudo_id,
ARRAY_AGG(DISTINCT user_id IGNORE NULLS) AS all_user_ids,
ARRAY_AGG(user_id ORDER BY event_timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] AS latest_user_id,
MAX(event_timestamp) AS last_seen_timestamp,
MAX(event_date) AS last_seen_date
FROM source_events
GROUP BY user_pseudo_id
)
SELECT
user_pseudo_id,
latest_user_id AS resolved_user_id,
all_user_ids,
ARRAY_LENGTH(all_user_ids) AS user_id_count,
ARRAY_LENGTH(all_user_ids) > 1 AS has_multiple_users,
TIMESTAMP_MICROS(last_seen_timestamp) AS last_seen_at,
last_seen_date
FROM aggregated

The 3-day lookback window in is_incremental() handles late-arriving GA4 events. GA4 can deliver events up to 72 hours after they occur; the 3-day window ensures they get captured in the identity mapping.

The user_id filter removes known garbage values before they pollute the mapping. Extend this list based on your own user_id audit.

resolved_user_id takes the most recent user_id as the canonical identity for the device. all_user_ids preserves the full history for auditability. has_multiple_users flags shared devices for downstream handling.

The stitched events model

-- models/intermediate/ga4/int__ga4__events_stitched.sql
{{
config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
'field': 'event_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by=['resolved_user_id', 'session_id']
)
}}
WITH events AS (
SELECT
user_pseudo_id,
user_id,
event_name,
event_timestamp,
event_params,
PARSE_DATE('%Y%m%d', event_date) AS event_date,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id
FROM {{ source('ga4', 'events') }}
WHERE TRUE
{% if is_incremental() %}
AND PARSE_DATE('%Y%m%d', event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
{% endif %}
),
identity_mapping AS (
SELECT
user_pseudo_id,
resolved_user_id,
has_multiple_users
FROM {{ ref('int__ga4__identity_mapping') }}
)
SELECT
e.user_pseudo_id,
e.user_id,
e.event_name,
e.event_timestamp,
e.event_date,
-- Identity resolution
COALESCE(m.resolved_user_id, e.user_pseudo_id) AS resolved_user_id,
m.resolved_user_id IS NOT NULL AS is_identified,
COALESCE(m.has_multiple_users, FALSE) AS is_shared_device,
-- Session identifier (composite to prevent collisions)
e.ga_session_id,
CONCAT(
COALESCE(m.resolved_user_id, e.user_pseudo_id),
'_',
CAST(e.ga_session_id AS STRING)
) AS session_id
FROM events e
LEFT JOIN identity_mapping m USING (user_pseudo_id)

The COALESCE(m.resolved_user_id, e.user_pseudo_id) fallback ensures every event has a non-null resolved_user_id. Anonymous events use their device ID as their identity rather than becoming null — which would make all anonymous events look like they belong to one user.

The session_id construction uses the stitched identity, not the raw user_pseudo_id. This means sessions from the same person on the same device before and after authentication share a session namespace, which matters for accurate session counting.

is_shared_device surfaces the shared-device flag so downstream marts can handle these events appropriately — either filtering them or weighting their contribution lower.

Schema tests

Tests for the identity mapping model focus on uniqueness and the upper bound guard:

models/intermediate/ga4/_int__ga4__models.yml
models:
- name: int__ga4__identity_mapping
columns:
- name: user_pseudo_id
tests:
- unique
- not_null
- name: resolved_user_id
tests:
- not_null
- name: user_id_count
tests:
- dbt_utils.accepted_range:
min_value: 1
max_value: 100 # Alert if a device maps to implausibly many users
- name: int__ga4__events_stitched
columns:
- name: session_id
tests:
- not_null
- name: resolved_user_id
tests:
- not_null

The accepted_range test on user_id_count is the automated guard against the tracking bugs that cause identity explosion. A device mapping to more than 100 user_id values is almost certainly a string null bug or test account issue, not a real user. The test failing in CI catches this before it corrupts your production mapping.

Why merge for mapping, insert_overwrite for events

Incremental strategy choice matters here because the two models have different update patterns.

The identity mapping has user_pseudo_id as its unique key. When a device that mapped to user_id = A last month now maps to user_id = B (account change, shared device), you want the row to be updated in place. merge on unique_key = 'user_pseudo_id' does exactly this.

The stitched events table has no single unique key — it’s a fact table with billions of rows. Merging at that scale is prohibitively expensive. Instead, insert_overwrite replaces recent date partitions wholesale. When new identity mappings resolve events that were previously anonymous, the 3-day lookback window reprocesses those partitions with the updated mapping — correcting the stitching retroactively.

This is why the identity mapping model must run before the stitched events model in your DAG. The ref('int__ga4__identity_mapping') dependency enforces this in dbt automatically.