A single authenticated user can have three, four, or more user_pseudo_id values in your GA4 data: one per phone, one from their work laptop, one from a shared family computer. The identity graph captures all of these device relationships and maps them back to a single business identity.
The graph has two shapes you need to maintain: user-centric (all devices for a given user_id) and device-centric (the user_id for a given user_pseudo_id). Queries need both directions.
The user-centric graph
The optimal production schema stores devices as a STRUCT array nested inside each user row. This avoids expensive joins for the common question “what devices does this user have?”:
CREATE OR REPLACE TABLE `project.dataset.identity_graph` ASWITH device_user_pairs AS ( SELECT user_id, user_pseudo_id, MIN(event_timestamp) AS first_seen, MAX(event_timestamp) AS last_seen, COUNT(*) AS event_count FROM `project.analytics_XXXXX.events_*` WHERE user_id IS NOT NULL AND user_pseudo_id IS NOT NULL AND user_id NOT IN ('null', 'undefined', 'none', '') GROUP BY user_id, user_pseudo_id)
SELECT user_id, ARRAY_AGG( STRUCT( user_pseudo_id, TIMESTAMP_MICROS(first_seen) AS first_seen, TIMESTAMP_MICROS(last_seen) AS last_seen, event_count ) ORDER BY last_seen DESC LIMIT 100 -- Guard against snowballing from tracking bugs ) AS devices, COUNT(DISTINCT user_pseudo_id) AS device_count, MIN(TIMESTAMP_MICROS(first_seen)) AS user_first_seen, MAX(TIMESTAMP_MICROS(last_seen)) AS user_last_activeFROM device_user_pairsGROUP BY user_idThe LIMIT 100 on the array aggregation is not arbitrary — it’s a safety constraint. If a user_id accumulates thousands of devices, something is wrong: the implementation team set a reset device ID on every page load, a test account is being reused, or a bug is generating fresh user_pseudo_id values continuously. Capping at 100 prevents those bugs from ballooning your identity graph and degrading query performance. Monitoring for high device_count values is how you catch those bugs. See Identity Resolution Monitoring for the SQL.
The device-centric reverse mapping
For event-level stitching, you need the reverse direction: given a user_pseudo_id, what is the resolved user_id? The device-centric mapping is a flat table you join against your events:
CREATE OR REPLACE TABLE `project.dataset.device_to_user_mapping` ASSELECT device.user_pseudo_id, user_id, device.first_seen, device.last_seenFROM `project.dataset.identity_graph`,UNNEST(devices) AS deviceThis is the table that powers full backstitching. Join it to your events on user_pseudo_id, and every event from a device that ever authenticated carries the resolved identity:
SELECT e.event_date, e.user_pseudo_id, COALESCE(m.user_id, e.user_pseudo_id) AS resolved_user_id, CASE WHEN m.user_id IS NOT NULL THEN 'identified' ELSE 'anonymous' END AS identity_status, e.event_name, e.event_timestampFROM `project.analytics_XXXXX.events_*` eLEFT JOIN `project.dataset.device_to_user_mapping` m USING (user_pseudo_id)The identity_status column is worth persisting in your stitched events model. It lets you understand what fraction of events have resolved identities and track changes over time.
Handling multiple user_id values per device
Real users sometimes appear under multiple user_id values from the same device: account merges after acquisition, email changes, separate work and personal accounts on the same browser, or test accounts sharing a session. Your graph needs a clear decision rule for which user_id wins.
Three common approaches:
-- Most recent: the current identity takes precedenceARRAY_AGG(user_id ORDER BY last_seen DESC LIMIT 1)[SAFE_OFFSET(0)]
-- First seen: stable identity, useful for acquisition cohortsARRAY_AGG(user_id ORDER BY first_seen ASC LIMIT 1)[SAFE_OFFSET(0)]
-- Most active: the identity with the most observed eventsARRAY_AGG(user_id ORDER BY event_count DESC LIMIT 1)[SAFE_OFFSET(0)]Most recent is the most defensible default — it reflects the current state of the account. Document your choice; it’s a business decision, not a technical one. Future analysts need to know why the stitching works the way it does.
Detecting shared devices
A user_pseudo_id that maps to multiple user_id values indicates a shared device: a family computer, a retail kiosk, or a team machine where multiple people log into your application. Naive backstitching on shared devices misattributes one user’s anonymous sessions to whoever last logged in.
Detecting them is straightforward:
SELECT user_pseudo_id, COUNT(DISTINCT user_id) AS distinct_users, ARRAY_AGG(DISTINCT user_id IGNORE NULLS ORDER BY user_id) AS user_idsFROM `project.analytics_XXXXX.events_*`WHERE user_id IS NOT NULLGROUP BY user_pseudo_idHAVING COUNT(DISTINCT user_id) > 1ORDER BY distinct_users DESCLIMIT 100Once detected, add a flag to your mapping table:
SELECT user_pseudo_id, COUNT(DISTINCT user_id) AS user_count, COUNT(DISTINCT user_id) > 3 AS is_shared_deviceFROM `project.analytics_XXXXX.events_*`WHERE user_id IS NOT NULLGROUP BY user_pseudo_idThe threshold of 3 is a pragmatic choice — a real shared device might have 2 users, but a genuinely broken implementation typically generates dozens. In your stitched events model, surface is_shared_device as a column so downstream models can filter or weight these events appropriately.
Cookie resets and orphaned sessions
Users who clear cookies, switch to incognito, or encounter Safari ITP restrictions generate fresh user_pseudo_id values. If they later authenticate on the new device identifier, the mapping table reconnects them. But anonymous sessions before the cookie reset remain permanently orphaned — there’s no identifier to link them back.
Browser privacy restrictions mean this isn’t rare. ITP caps cookies at 7 days in many configurations. You can track fragmentation rates to understand how much identity loss you’re experiencing:
SELECT device_count, COUNT(*) AS users, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pctFROM ( SELECT user_id, COUNT(DISTINCT user_pseudo_id) AS device_count FROM `project.analytics_XXXXX.events_*` WHERE user_id IS NOT NULL GROUP BY user_id)GROUP BY device_countORDER BY device_countA user with 5 user_pseudo_id values could mean cross-device usage (normal) or aggressive cookie clearing (lost pre-reset sessions). The distribution tells you how widespread the pattern is; it doesn’t tell you which cause. Use it alongside your stitch rate metric as a signal of overall identity health.
Limits of the identity graph
The graph only captures what GA4 observed. Anonymous users who never authenticate stay anonymous. Sessions before a user’s first login on a given device stay orphaned unless the user authenticates and the mapping table connects them retroactively.
For connecting anonymous users to CRM records or business outcomes, the Customer 360 identity resolution layer goes further — capturing form submissions with hidden user_pseudo_id fields, matching on transaction IDs, and building deterministic linkages that don’t require a GA4 login event.