ServicesAboutNotesContact Get in touch →
EN FR
Note

GA4 Identity Stitching Techniques

The four SQL patterns for resolving GA4 anonymous-to-known user identity — last-touch, first-touch, full backstitch, and session-scoped — with a decision framework for choosing between them.

Planted
ga4bigqueryanalyticsdata modeling

GA4’s BigQuery export ships raw event-level data: user_pseudo_id (the device cookie) and user_id (your business identifier, when you’ve set it). When a user browses anonymously then logs in on a different device days later, GA4 treats them as separate people. Building your own stitching layer in BigQuery is the only way to recover the connected picture.

The appropriate technique depends on whether backward-in-time resolution is needed, how conservative the false-match tolerance is, and what downstream analysis will consume the stitched data.

The four techniques

Last-touch attribution (window function approach)

Last-touch assigns the most recent known user_id to all historical events from the same user_pseudo_id. This handles the typical flow: anonymous browsing followed by a login event.

WITH events_with_last_touch AS (
SELECT
event_date,
user_pseudo_id,
user_id,
event_name,
event_timestamp,
LAST_VALUE(user_id IGNORE NULLS) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS stitched_user_id
FROM `project.analytics_XXXXX.events_*`
WHERE _table_suffix BETWEEN '20250101' AND '20250131'
)
SELECT
event_date,
user_pseudo_id,
COALESCE(stitched_user_id, user_pseudo_id) AS resolved_user_id,
event_name,
TIMESTAMP_MICROS(event_timestamp) AS event_time
FROM events_with_last_touch

The IGNORE NULLS clause is essential. Without it, LAST_VALUE returns the current row’s value — null whenever user_id is absent. The window frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ensures you’re looking at all prior events in the partition.

The critical limitation: this only stitches events before the identification moment, within the date range you’re querying. If a user authenticated in December and you’re querying January, their January anonymous events won’t get stitched unless December data is in the same query. For true cross-session backstitching, you need the mapping table approach.

Full backstitching via mapping table

For complete identity resolution — applying a user_id to all events from a device regardless of when authentication occurred — build a mapping table first, then join it to events. This is what GA4 User Backstitching covers in depth.

The pattern:

  1. Extract all (user_pseudo_id, user_id) pairs observed in your data where user_id IS NOT NULL
  2. Resolve conflicts (multiple user_id values per device) with a consistent decision rule
  3. Join the resolved mapping back to your full events table

The result: every event from a device that ever authenticated carries the resolved user_id, regardless of whether that event happened before or after the login. Past, present, and future sessions all get connected.

First-touch attribution

First-touch uses FIRST_VALUE instead of LAST_VALUE, assigning the earliest known user_id to all events. The window frame must span the entire partition:

FIRST_VALUE(user_id IGNORE NULLS) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first_known_user_id

Use first-touch when you’re analyzing acquisition: you want to attribute a user to their original authenticated identity, not a later account change. If someone changes their email (and therefore their user_id) six months in, first-touch ties all their events to the original identity, which is usually more useful for cohort analysis.

Last-touch is better when current identity matters — personalization, support case context, current account status.

Session-scoped stitching

Session-scoped stitching limits identity resolution to events within the same session. It’s the most conservative approach and a good starting point for validating your logic before committing to full backstitching.

WITH session_events AS (
SELECT
*,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id
FROM `project.analytics_XXXXX.events_*`
),
stitched AS (
SELECT
*,
FIRST_VALUE(user_id IGNORE NULLS) OVER (
PARTITION BY user_pseudo_id, ga_session_id
ORDER BY event_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS session_user_id
FROM session_events
)
SELECT
*,
COALESCE(session_user_id, user_pseudo_id) AS resolved_user_id,
CONCAT(
COALESCE(session_user_id, user_pseudo_id),
'_',
CAST(ga_session_id AS STRING)
) AS unique_session_id
FROM stitched

This handles the case where a user logs in partway through a session — the login event provides the user_id, and it’s applied to all events in that session including those that came before. Note the unique_session_id construction: ga_session_id alone is not globally unique (it’s a Unix timestamp), so it must always be combined with the user identifier. See GA4 Session Key Construction for why this matters.

Choosing the right technique

TechniqueScopeRisk levelBest for
Last-touch (window)Forward-looking within query windowLowBasic analytics, recent behavior, ad hoc analysis
First-touchAll eventsMediumAcquisition attribution, cohort analysis
Full backstitchAll eventsHigherCross-device analysis, LTV calculations, Customer 360
Session-scopedSingle sessionLowestTesting stitching logic, conservative reporting

Start with session-scoped stitching to validate your SQL and check for shared device issues. Once you’ve confirmed data quality, move to full backstitching for production.

The risk rating reflects data integrity risk, not implementation complexity. Full backstitching is “higher risk” not because it’s harder to write, but because an incorrect mapping table (from logout bugs, shared devices, or garbage user_id values) silently corrupts every downstream model that uses the stitched identity. The user_id data quality checks belong before you build the mapping table, not after.

The COALESCE fallback pattern

All four techniques share a common finishing step: when stitching fails (no user_id was ever observed for a device), fall back to user_pseudo_id as the identity:

COALESCE(resolved_user_id, user_pseudo_id) AS final_user_id

This ensures every event has a non-null identity. Anonymous users remain separate identities — tracked by their device ID rather than a business ID — rather than being collapsed into a single null group. Don’t force probabilistic matching to close this gap; the false positives from fingerprinting with GA4’s available data (device category, browser, country) are worse than accepting a lower identification rate. See Probabilistic Matching Limitations in GA4 for why.

Where stitching lives in the dbt DAG

All four techniques produce a resolved_user_id column that replaces raw user_id in downstream models. In a production dbt pipeline, the stitching layer sits between your base GA4 events model and your session/attribution marts:

base__ga4__events
int__ga4__identity_mapping (the mapping table — build this first)
int__ga4__events_stitched (apply mapping to events)
int__ga4__sessions
mrt__analytics__user_journey

The identity mapping model uses merge incremental strategy (small table, updates frequently). The stitched events model uses insert_overwrite by date partition (large table, reprocess recent dates).