ServicesAboutNotesContact Get in touch →
EN FR
Note

GA4 User Backstitching

How to retroactively apply GA4 user_id to anonymous sessions in the warehouse — the SQL pattern, shared device handling, and when backstitching is worth the complexity.

Planted
ga4bigquerydbtdata modelinganalytics

GA4 applies user_id to all events within the current session when it’s set, but it does not retroactively apply it to previous sessions. If a visitor browses your site anonymously for three sessions before logging in on session four, sessions one through three remain anonymous in GA4’s data — even though you now know who that person is.

This means your pre-login browsing data is disconnected from the person who eventually converted. For Customer 360 models and marketing attribution, this gap is significant. The visitor’s entire discovery journey — the blog posts they read, the pricing page visits, the feature comparisons — stays attached to an anonymous user_pseudo_id that your CRM knows nothing about.

User backstitching fixes this in the warehouse by using a simple principle: if we ever observed a user_id associated with a given user_pseudo_id, apply it to all events from that device.

The core SQL pattern

The technique requires two steps: build a lookup of known identity pairs, then join it back to all events.

WITH known_identities AS (
SELECT DISTINCT
user_id,
user_pseudo_id
FROM {{ ref('base__ga4__events') }}
WHERE user_id IS NOT NULL
)
SELECT
e.event_date,
e.event_timestamp,
e.event_name,
e.user_pseudo_id,
-- Prefer the event's own user_id; fall back to the backstitched value
COALESCE(e.user_id, k.user_id) AS resolved_user_id
FROM {{ ref('base__ga4__events') }} e
LEFT JOIN known_identities k
ON k.user_pseudo_id = e.user_pseudo_id

The COALESCE ensures that if an event already has a user_id set (because the user was logged in at the time), the original value is preserved. The backstitched value only fills in events where user_id was null.

The resolved_user_id column becomes the bridge to your CRM data. Once you know the user_id for a user_pseudo_id, you can trace that device’s entire browsing history — including the anonymous sessions that preceded login.

Handling shared devices

The main edge case is shared devices. If two people use the same browser, the same user_pseudo_id could map to multiple user_id values. Without handling this, the known_identities CTE produces a fan-out: one user_pseudo_id row joins to multiple identity rows, duplicating events.

The safest approach is to take the most recent mapping:

WITH known_identities AS (
SELECT
user_id,
user_pseudo_id
FROM {{ ref('base__ga4__events') }}
WHERE user_id IS NOT NULL
QUALIFY ROW_NUMBER() OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp DESC
) = 1
)

This says: for each device, the most recent person to log in is the one we associate with all events. It’s a heuristic, not a perfect solution. Events from the other person get misattributed. But shared devices are relatively rare in practice (especially on mobile), and the alternative — not backstitching at all — loses far more data than the shared device edge case corrupts.

For teams that need higher accuracy, a time-bounded approach assigns events to the user who was logged in closest in time:

WITH known_identities AS (
SELECT
user_id,
user_pseudo_id,
event_timestamp AS identified_at,
LEAD(event_timestamp) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
) AS next_identification_at
FROM {{ ref('base__ga4__events') }}
WHERE user_id IS NOT NULL
)
SELECT
e.*,
COALESCE(
e.user_id,
k.user_id
) AS resolved_user_id
FROM {{ ref('base__ga4__events') }} e
LEFT JOIN known_identities k
ON k.user_pseudo_id = e.user_pseudo_id
AND e.event_timestamp <= k.identified_at
AND (k.next_identification_at IS NULL
OR e.event_timestamp > COALESCE(
LAG(k.identified_at) OVER (
PARTITION BY k.user_pseudo_id
ORDER BY k.identified_at
),
TIMESTAMP('1970-01-01')
))

This is more complex and harder to maintain. The simple most-recent approach works for the vast majority of use cases.

Where backstitching fits in the dbt DAG

Backstitching belongs in the identity resolution layer, after base models and before mart models. In the Customer 360 DAG, it typically lives in or feeds into the int__identity_resolved model:

base__ga4__events
int__ga4__events_with_resolved_identity (backstitching happens here)
int__identity_resolved (combines all identity sources)
mrt__core__customer_360

The backstitched events model adds resolved_user_id to every GA4 event. Downstream models — sessionized events, touchpoint timelines, attribution models — all consume this resolved identity rather than the raw user_id field.

Materialization

The known_identities lookup is small (at most one row per device that ever had a logged-in user), but the join against the full events table is expensive. Materialize the backstitched events model as an incremental model using insert_overwrite with a lookback window:

{{ config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
'field': 'event_date',
'data_type': 'date'
},
partitions=['CURRENT_DATE()',
'DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)',
'DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY)']
) }}

A 3-day lookback handles late-arriving GA4 events. The identity lookup should still scan the full table (or at minimum, the last 90 days) to catch identities established weeks before the current processing window.

When backstitching matters — and when it doesn’t

Backstitching is most valuable when:

  • You have a significant gap between a visitor’s first touchpoint and their first login (common in B2B SaaS with long evaluation cycles)
  • Your attribution model needs the pre-login browsing history to credit awareness channels
  • You’re building a Customer 360 that should include the full journey, not just post-login behavior

Backstitching adds less value when:

  • Users log in on first visit (e.g., a product with mandatory authentication)
  • Your analytics focus is on post-login product usage, not marketing attribution
  • Cookie lifespans are very short due to ITP/browser restrictions, so user_pseudo_id values don’t persist long enough to be worth stitching

The technique is also limited by the cross-device problem. Backstitching connects anonymous sessions to a known user on the same device. It cannot link sessions across devices — that requires the user to log in on each device independently. For cross-device identity, you need the full identity resolution layer combining form submissions, logins, and potentially transaction IDs.