ServicesAboutNotesContact Get in touch →
EN FR
Note

GA4 Identity Graph in BigQuery

How to build a production identity graph from GA4 BigQuery data — mapping user_id to all associated devices, detecting shared devices and anomalies, and structuring forward and reverse lookups.

Planted
ga4bigquerydbtanalyticsdata modeling

A single authenticated user can have three, four, or more user_pseudo_id values in your GA4 data: one per phone, one from their work laptop, one from a shared family computer. The identity graph captures all of these device relationships and maps them back to a single business identity.

The graph has two shapes you need to maintain: user-centric (all devices for a given user_id) and device-centric (the user_id for a given user_pseudo_id). Queries need both directions.

The user-centric graph

The optimal production schema stores devices as a STRUCT array nested inside each user row. This avoids expensive joins for the common question “what devices does this user have?”:

CREATE OR REPLACE TABLE `project.dataset.identity_graph` AS
WITH device_user_pairs AS (
SELECT
user_id,
user_pseudo_id,
MIN(event_timestamp) AS first_seen,
MAX(event_timestamp) AS last_seen,
COUNT(*) AS event_count
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
AND user_pseudo_id IS NOT NULL
AND user_id NOT IN ('null', 'undefined', 'none', '')
GROUP BY user_id, user_pseudo_id
)
SELECT
user_id,
ARRAY_AGG(
STRUCT(
user_pseudo_id,
TIMESTAMP_MICROS(first_seen) AS first_seen,
TIMESTAMP_MICROS(last_seen) AS last_seen,
event_count
)
ORDER BY last_seen DESC
LIMIT 100 -- Guard against snowballing from tracking bugs
) AS devices,
COUNT(DISTINCT user_pseudo_id) AS device_count,
MIN(TIMESTAMP_MICROS(first_seen)) AS user_first_seen,
MAX(TIMESTAMP_MICROS(last_seen)) AS user_last_active
FROM device_user_pairs
GROUP BY user_id

The LIMIT 100 on the array aggregation is not arbitrary — it’s a safety constraint. If a user_id accumulates thousands of devices, something is wrong: the implementation team set a reset device ID on every page load, a test account is being reused, or a bug is generating fresh user_pseudo_id values continuously. Capping at 100 prevents those bugs from ballooning your identity graph and degrading query performance. Monitoring for high device_count values is how you catch those bugs. See Identity Resolution Monitoring for the SQL.

The device-centric reverse mapping

For event-level stitching, you need the reverse direction: given a user_pseudo_id, what is the resolved user_id? The device-centric mapping is a flat table you join against your events:

CREATE OR REPLACE TABLE `project.dataset.device_to_user_mapping` AS
SELECT
device.user_pseudo_id,
user_id,
device.first_seen,
device.last_seen
FROM `project.dataset.identity_graph`,
UNNEST(devices) AS device

This is the table that powers full backstitching. Join it to your events on user_pseudo_id, and every event from a device that ever authenticated carries the resolved identity:

SELECT
e.event_date,
e.user_pseudo_id,
COALESCE(m.user_id, e.user_pseudo_id) AS resolved_user_id,
CASE
WHEN m.user_id IS NOT NULL THEN 'identified'
ELSE 'anonymous'
END AS identity_status,
e.event_name,
e.event_timestamp
FROM `project.analytics_XXXXX.events_*` e
LEFT JOIN `project.dataset.device_to_user_mapping` m
USING (user_pseudo_id)

The identity_status column is worth persisting in your stitched events model. It lets you understand what fraction of events have resolved identities and track changes over time.

Handling multiple user_id values per device

Real users sometimes appear under multiple user_id values from the same device: account merges after acquisition, email changes, separate work and personal accounts on the same browser, or test accounts sharing a session. Your graph needs a clear decision rule for which user_id wins.

Three common approaches:

-- Most recent: the current identity takes precedence
ARRAY_AGG(user_id ORDER BY last_seen DESC LIMIT 1)[SAFE_OFFSET(0)]
-- First seen: stable identity, useful for acquisition cohorts
ARRAY_AGG(user_id ORDER BY first_seen ASC LIMIT 1)[SAFE_OFFSET(0)]
-- Most active: the identity with the most observed events
ARRAY_AGG(user_id ORDER BY event_count DESC LIMIT 1)[SAFE_OFFSET(0)]

Most recent is the most defensible default — it reflects the current state of the account. Document your choice; it’s a business decision, not a technical one. Future analysts need to know why the stitching works the way it does.

Detecting shared devices

A user_pseudo_id that maps to multiple user_id values indicates a shared device: a family computer, a retail kiosk, or a team machine where multiple people log into your application. Naive backstitching on shared devices misattributes one user’s anonymous sessions to whoever last logged in.

Detecting them is straightforward:

SELECT
user_pseudo_id,
COUNT(DISTINCT user_id) AS distinct_users,
ARRAY_AGG(DISTINCT user_id IGNORE NULLS ORDER BY user_id) AS user_ids
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
GROUP BY user_pseudo_id
HAVING COUNT(DISTINCT user_id) > 1
ORDER BY distinct_users DESC
LIMIT 100

Once detected, add a flag to your mapping table:

SELECT
user_pseudo_id,
COUNT(DISTINCT user_id) AS user_count,
COUNT(DISTINCT user_id) > 3 AS is_shared_device
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
GROUP BY user_pseudo_id

The threshold of 3 is a pragmatic choice — a real shared device might have 2 users, but a genuinely broken implementation typically generates dozens. In your stitched events model, surface is_shared_device as a column so downstream models can filter or weight these events appropriately.

Users who clear cookies, switch to incognito, or encounter Safari ITP restrictions generate fresh user_pseudo_id values. If they later authenticate on the new device identifier, the mapping table reconnects them. But anonymous sessions before the cookie reset remain permanently orphaned — there’s no identifier to link them back.

Browser privacy restrictions mean this isn’t rare. ITP caps cookies at 7 days in many configurations. You can track fragmentation rates to understand how much identity loss you’re experiencing:

SELECT
device_count,
COUNT(*) AS users,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct
FROM (
SELECT
user_id,
COUNT(DISTINCT user_pseudo_id) AS device_count
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
GROUP BY user_id
)
GROUP BY device_count
ORDER BY device_count

A user with 5 user_pseudo_id values could mean cross-device usage (normal) or aggressive cookie clearing (lost pre-reset sessions). The distribution tells you how widespread the pattern is; it doesn’t tell you which cause. Use it alongside your stitch rate metric as a signal of overall identity health.

Limits of the identity graph

The graph only captures what GA4 observed. Anonymous users who never authenticate stay anonymous. Sessions before a user’s first login on a given device stay orphaned unless the user authenticates and the mapping table connects them retroactively.

For connecting anonymous users to CRM records or business outcomes, the Customer 360 identity resolution layer goes further — capturing form submissions with hidden user_pseudo_id fields, matching on transaction IDs, and building deterministic linkages that don’t require a GA4 login event.