dbt Identity Resolution Pipeline

This note covers the production dbt DAG structure for GA4 identity resolution. Two models are involved: an incremental identity mapping table (maintained via merge) and a stitched events table (maintained via insert_overwrite). Both use a 3-day lookback window for late-arriving data.

DAG structure

The identity resolution layer sits between your raw GA4 source and your session/attribution marts:

sources/ga4/events
    │
    ├── int__ga4__identity_mapping     (incremental, merge)
    │
    └── int__ga4__events_stitched      (incremental, insert_overwrite)
            │
            ├── int__ga4__sessions
            │
            └── mrt__analytics__user_journey

Two models, two different incremental strategies, for two different reasons:

Identity mapping is small (one row per device that ever authenticated) and updates by merging new or updated rows. merge strategy is the right choice.
Stitched events is large (every event in your history) and updates by reprocessing recent date partitions. insert_overwrite is the right choice.

The identity mapping model

-- models/intermediate/ga4/int__ga4__identity_mapping.sql

{{
    config(
        materialized='incremental',
        unique_key='user_pseudo_id',
        incremental_strategy='merge',
        partition_by={
            'field': 'last_seen_date',
            'data_type': 'date',
            'granularity': 'day'
        },
        cluster_by=['user_pseudo_id']
    )
}}

WITH source_events AS (
    SELECT
        user_pseudo_id,
        user_id,
        event_timestamp,
        PARSE_DATE('%Y%m%d', event_date) AS event_date
    FROM {{ source('ga4', 'events') }}
    WHERE user_pseudo_id IS NOT NULL
      AND user_id IS NOT NULL
      AND user_id NOT IN ('null', 'undefined', 'none', '')
    {% if is_incremental() %}
      AND PARSE_DATE('%Y%m%d', event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
    {% endif %}
),

aggregated AS (
    SELECT
        user_pseudo_id,
        ARRAY_AGG(DISTINCT user_id IGNORE NULLS) AS all_user_ids,
        ARRAY_AGG(user_id ORDER BY event_timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] AS latest_user_id,
        MAX(event_timestamp) AS last_seen_timestamp,
        MAX(event_date) AS last_seen_date
    FROM source_events
    GROUP BY user_pseudo_id
)

SELECT
    user_pseudo_id,
    latest_user_id AS resolved_user_id,
    all_user_ids,
    ARRAY_LENGTH(all_user_ids) AS user_id_count,
    ARRAY_LENGTH(all_user_ids) > 1 AS has_multiple_users,
    TIMESTAMP_MICROS(last_seen_timestamp) AS last_seen_at,
    last_seen_date
FROM aggregated

The 3-day lookback window in is_incremental() handles late-arriving GA4 events. GA4 can deliver events up to 72 hours after they occur; the 3-day window ensures they get captured in the identity mapping.

The user_id filter removes known garbage values before they pollute the mapping. Extend this list based on your own user_id audit.

resolved_user_id takes the most recent user_id as the canonical identity for the device. all_user_ids preserves the full history for auditability. has_multiple_users flags shared devices for downstream handling.

The stitched events model

-- models/intermediate/ga4/int__ga4__events_stitched.sql

{{
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        partition_by={
            'field': 'event_date',
            'data_type': 'date',
            'granularity': 'day'
        },
        cluster_by=['resolved_user_id', 'session_id']
    )
}}

WITH events AS (
    SELECT
        user_pseudo_id,
        user_id,
        event_name,
        event_timestamp,
        event_params,
        PARSE_DATE('%Y%m%d', event_date) AS event_date,
        (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id
    FROM {{ source('ga4', 'events') }}
    WHERE TRUE
    {% if is_incremental() %}
      AND PARSE_DATE('%Y%m%d', event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
    {% endif %}
),

identity_mapping AS (
    SELECT
        user_pseudo_id,
        resolved_user_id,
        has_multiple_users
    FROM {{ ref('int__ga4__identity_mapping') }}
)

SELECT
    e.user_pseudo_id,
    e.user_id,
    e.event_name,
    e.event_timestamp,
    e.event_date,

    -- Identity resolution
    COALESCE(m.resolved_user_id, e.user_pseudo_id) AS resolved_user_id,
    m.resolved_user_id IS NOT NULL AS is_identified,
    COALESCE(m.has_multiple_users, FALSE) AS is_shared_device,

    -- Session identifier (composite to prevent collisions)
    e.ga_session_id,
    CONCAT(
        COALESCE(m.resolved_user_id, e.user_pseudo_id),
        '_',
        CAST(e.ga_session_id AS STRING)
    ) AS session_id

FROM events e
LEFT JOIN identity_mapping m USING (user_pseudo_id)

The COALESCE(m.resolved_user_id, e.user_pseudo_id) fallback ensures every event has a non-null resolved_user_id. Anonymous events use their device ID as their identity rather than becoming null — which would make all anonymous events look like they belong to one user.

The session_id construction uses the stitched identity, not the raw user_pseudo_id. This means sessions from the same person on the same device before and after authentication share a session namespace, which matters for accurate session counting.

is_shared_device surfaces the shared-device flag so downstream marts can handle these events appropriately — either filtering them or weighting their contribution lower.

Schema tests

Tests for the identity mapping model focus on uniqueness and the upper bound guard:

models:
  - name: int__ga4__identity_mapping
    columns:
      - name: user_pseudo_id
        tests:
          - unique
          - not_null
      - name: resolved_user_id
        tests:
          - not_null
      - name: user_id_count
        tests:
          - dbt_utils.accepted_range:
              min_value: 1
              max_value: 100  # Alert if a device maps to implausibly many users

  - name: int__ga4__events_stitched
    columns:
      - name: session_id
        tests:
          - not_null
      - name: resolved_user_id
        tests:
          - not_null

The accepted_range test on user_id_count is the automated guard against the tracking bugs that cause identity explosion. A device mapping to more than 100 user_id values is almost certainly a string null bug or test account issue, not a real user. The test failing in CI catches this before it corrupts your production mapping.

Why `merge` for mapping, `insert_overwrite` for events

Incremental strategy choice matters here because the two models have different update patterns.

The identity mapping has user_pseudo_id as its unique key. When a device that mapped to user_id = A last month now maps to user_id = B (account change, shared device), you want the row to be updated in place. merge on unique_key = 'user_pseudo_id' does exactly this.

The stitched events table has no single unique key — it’s a fact table with billions of rows. Merging at that scale is prohibitively expensive. Instead, insert_overwrite replaces recent date partitions wholesale. When new identity mappings resolve events that were previously anonymous, the 3-day lookback window reprocesses those partitions with the updated mapping — correcting the stitching retroactively.

This is why the identity mapping model must run before the stitched events model in your DAG. The ref('int__ga4__identity_mapping') dependency enforces this in dbt automatically.

DAG structure

The identity mapping model

The stitched events model

Schema tests

Why merge for mapping, insert_overwrite for events

Why `merge` for mapping, `insert_overwrite` for events