Adrienne Vermorel

GA4 User Stitching: Handling Anonymous to Known Users

This is part 4 of a 5-part series on GA4 and BigQuery. Previous articles covered the schema reference, unnesting patterns, and session tables.


When users browse your site anonymously then log in on a different device days later, GA4’s BigQuery export treats them as entirely separate people. Unlike the GA4 interface, which applies identity resolution behind the scenes, BigQuery receives raw event data without any stitching. Your user counts are inflated, attribution models are fractured, and customer lifetime value calculations miss the complete picture.

The solution is building your own identity resolution pipeline. Done well, stitching can recover part of the sessions that would otherwise appear fragmented. Done poorly, it creates data integrity nightmares that erode trust in your analytics.

This article walks through the complete implementation: understanding GA4’s identity model, SQL patterns for different stitching approaches, handling edge cases, and building production-ready dbt models.


Understanding GA4’s Two Identity Systems

GA4 tracks users through two fundamentally different identifiers that appear in every BigQuery event row. Understanding their behavior is essential before writing any stitching logic.

user_pseudo_id: The Device Identifier

The user_pseudo_id is GA4’s device-level identifier (essentially the same concept as Universal Analytics’ Client ID). On web, it’s derived from the _ga first-party cookie with a format like 1197596843.1673515099 (random number + timestamp). On mobile apps, it’s the Firebase App Instance ID, unique per installation.

This identifier persists for up to two years by default, but browser privacy features increasingly limit this:

  • Safari’s ITP limits JavaScript-set cookies to roughly 7 days
  • Chrome now caps cookies at 400 days
  • Incognito sessions generate ephemeral IDs

The user_pseudo_id is always present in BigQuery. It’s GA4’s fallback identifier when no authenticated identity exists.

user_id: The Authenticated Identifier

The user_id is your business-provided identifier, sent only when you explicitly implement it:

// gtag.js implementation
gtag('config', 'G-XXXXXXX', {
'user_id': 'USER_12345'
});
// Or via GTM dataLayer
dataLayer.push({
'user_id': 'USER_12345'
});

Unlike user_pseudo_id, GA4 doesn’t generate this automatically. You must capture it from your authentication system. When users aren’t logged in, this field is NULL.

Important: never send PII like email addresses directly. Use hashed or anonymized database IDs instead.

-- Examining identity fields in your data
SELECT
user_pseudo_id,
user_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id,
event_name,
TIMESTAMP_MICROS(event_timestamp) AS event_time
FROM `project.analytics_XXXXX.events_*`
WHERE _table_suffix = FORMAT_DATE('%Y%m%d', CURRENT_DATE() - 1)
LIMIT 100

Why BigQuery Differs from GA4 UI Reports

GA4’s interface applies identity resolution automatically through “Reporting Identity.” The platform offers three modes:

ModeResolution Order
BlendedUser ID → Device ID → Modeling
ObservedUser ID → Device ID
Device-basedDevice ID only

For BigQuery work, the key point is that none of this stitching logic transfers to your export. BigQuery receives raw, event-level data with whatever user_id was present at collection time. No modeling, no cross-device resolution, no Google Signals data.

When comparing GA4 UI to BigQuery queries, expect discrepancies from:

  • Behavioral modeling in the UI (BigQuery doesn’t export modeled data)
  • Google Signals data (never appears in BigQuery)
  • HyperLogLog++ estimation for large cardinality counts in the UI

For analytics engineering work, treat BigQuery as your source of truth and accept that UI numbers will differ.


Stitching Techniques with BigQuery SQL

There’s no single “correct” stitching approach. The right choice depends on your use case, data quality, and tolerance for false matches. Here are the four main patterns.

Last-Touch Attribution

Last-touch attribution assigns the most recent known user_id to all historical events from the same user_pseudo_id. This handles the typical flow: anonymous browsing followed by authentication.

WITH events_with_last_touch AS (
SELECT
event_date,
user_pseudo_id,
user_id,
event_name,
event_timestamp,
LAST_VALUE(user_id IGNORE NULLS) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS stitched_user_id
FROM `project.analytics_XXXXX.events_*`
WHERE _table_suffix BETWEEN '20250101' AND '20250131'
)
SELECT
event_date,
user_pseudo_id,
user_id AS original_user_id,
COALESCE(stitched_user_id, user_pseudo_id) AS resolved_user_id,
event_name,
TIMESTAMP_MICROS(event_timestamp) AS event_time
FROM events_with_last_touch

The IGNORE NULLS clause is essential. Without it, LAST_VALUE returns NULL whenever the current row’s user_id is null. The window frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ensures you’re looking at all prior events in the partition.

This only stitches events before the identification moment. Events after authentication already have user_id populated. If you need to connect a user’s activity across multiple authentication sessions, you need full backstitching.

Full Backstitching via Mapping Table

For complete identity resolution (applying a user_id to all events from a device, regardless of when authentication occurred), build a mapping table first:

-- Step 1: Build the identity mapping
CREATE OR REPLACE TABLE `project.dataset.user_identity_mapping` AS
WITH identity_pairs AS (
SELECT
user_pseudo_id,
user_id,
MAX(event_timestamp) AS last_seen_timestamp
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
AND user_pseudo_id IS NOT NULL
GROUP BY user_pseudo_id, user_id
)
SELECT
user_pseudo_id,
-- Take the most recent user_id if multiple exist
ARRAY_AGG(user_id ORDER BY last_seen_timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] AS resolved_user_id,
ARRAY_AGG(DISTINCT user_id) AS all_user_ids,
MIN(TIMESTAMP_MICROS(last_seen_timestamp)) AS first_identified,
MAX(TIMESTAMP_MICROS(last_seen_timestamp)) AS last_identified
FROM identity_pairs
GROUP BY user_pseudo_id
-- Step 2: Apply stitching via JOIN
SELECT
e.event_date,
e.user_pseudo_id,
e.user_id AS original_user_id,
COALESCE(m.resolved_user_id, e.user_pseudo_id) AS stitched_user_id,
e.event_name
FROM `project.analytics_XXXXX.events_*` e
LEFT JOIN `project.dataset.user_identity_mapping` m
USING (user_pseudo_id)
WHERE e._table_suffix BETWEEN '20250101' AND '20250131'

This applies the identified user_id to every event from that device (past, present, and future), enabling true cross-session analysis.

First-Touch Attribution

First-touch attribution uses FIRST_VALUE instead of LAST_VALUE, assigning the earliest known user_id to all events. This approach makes sense for user acquisition analysis where you want to attribute the user to their original authenticated identity, not account changes later.

SELECT
event_date,
user_pseudo_id,
user_id,
event_name,
event_timestamp,
FIRST_VALUE(user_id IGNORE NULLS) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS first_known_user_id
FROM `project.analytics_XXXXX.events_*`

Use first-touch when analyzing conversion paths from anonymous to known user. Use last-touch when current identity matters most (e.g., personalization, current account status).

Session-Scoped Stitching

Session-scoped stitching limits identity resolution to events within the same session. This conservative approach respects the natural boundary GA4 draws around user behavior.

WITH session_events AS (
SELECT
event_date,
user_pseudo_id,
user_id,
event_name,
event_timestamp,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id
FROM `project.analytics_XXXXX.events_*`
WHERE _table_suffix = FORMAT_DATE('%Y%m%d', CURRENT_DATE() - 1)
),
stitched AS (
SELECT
*,
FIRST_VALUE(user_id IGNORE NULLS) OVER (
PARTITION BY user_pseudo_id, ga_session_id
ORDER BY event_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS session_user_id
FROM session_events
)
SELECT
event_date,
user_pseudo_id,
user_id,
event_name,
event_timestamp,
ga_session_id,
session_user_id,
COALESCE(session_user_id, user_pseudo_id) AS resolved_user_id,
-- Create truly unique session identifier
CONCAT(
COALESCE(session_user_id, user_pseudo_id),
'_',
CAST(ga_session_id AS STRING)
) AS unique_session_id
FROM stitched

ga_session_id is not globally unique. It’s the Unix timestamp of session start. Always concatenate it with user_pseudo_id (or your stitched ID) to create a truly unique session identifier.

Choosing the Right Approach

TechniqueScopeRisk LevelBest For
Last-touchForward-lookingLowBasic analytics, recent behavior
First-touchAll eventsMediumAcquisition attribution
Full backstitchAll eventsHigherCross-device, LTV analysis
Session-scopedSingle sessionLowestConservative approach, testing

Start with session-scoped stitching to validate your logic, then graduate to full backstitching once you’ve confirmed data quality.


Cross-Device Identity Resolution

User stitching becomes particularly valuable when connecting journeys across devices. A single authenticated user might have three user_pseudo_id values: one from their phone, one from their laptop, and one from a work computer. Your identity mapping table is the foundation for unifying these.

Building a Production Identity Mapping Table

The optimal schema stores the relationship from user_id to all associated devices:

CREATE OR REPLACE TABLE `project.dataset.identity_graph` AS
WITH device_user_pairs AS (
SELECT
user_id,
user_pseudo_id,
MIN(event_timestamp) AS first_seen,
MAX(event_timestamp) AS last_seen,
COUNT(*) AS event_count
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
AND user_pseudo_id IS NOT NULL
GROUP BY user_id, user_pseudo_id
)
SELECT
user_id,
ARRAY_AGG(
STRUCT(
user_pseudo_id,
TIMESTAMP_MICROS(first_seen) AS first_seen,
TIMESTAMP_MICROS(last_seen) AS last_seen,
event_count
)
ORDER BY last_seen DESC
LIMIT 100 -- Prevent snowballing from tracking bugs
) AS devices,
COUNT(DISTINCT user_pseudo_id) AS device_count,
MIN(TIMESTAMP_MICROS(first_seen)) AS user_first_seen,
MAX(TIMESTAMP_MICROS(last_seen)) AS user_last_active
FROM device_user_pairs
GROUP BY user_id

Set a limit on identifiers per user (around 100). If a single user_id accumulates thousands of user_pseudo_id values, something is wrong. Possible causes include a reset device ID on every page load, or a test account being reused.

For querying, you’ll often want the reverse mapping (device → user):

CREATE OR REPLACE TABLE `project.dataset.device_to_user_mapping` AS
SELECT
device.user_pseudo_id,
user_id,
device.first_seen,
device.last_seen
FROM `project.dataset.identity_graph`,
UNNEST(devices) AS device

Handling Users Who Never Authenticate

Not every visitor will log in. Your stitching logic must gracefully handle anonymous users:

SELECT
COALESCE(m.user_id, e.user_pseudo_id) AS resolved_user_id,
CASE
WHEN m.user_id IS NOT NULL THEN 'identified'
ELSE 'anonymous'
END AS identity_status,
e.event_date,
e.user_pseudo_id,
e.user_id,
e.event_name,
e.event_timestamp
FROM `project.analytics_XXXXX.events_*` e
LEFT JOIN `project.dataset.device_to_user_mapping` m
USING (user_pseudo_id)

Accept that anonymous users remain separate identities. Don’t force probabilistic matching to inflate your identified user percentage. The false positives probably aren’t worth it.


Why Probabilistic Matching Fails in GA4

You might consider increasing match rates through fingerprinting (matching users by device characteristics, location, and behavior patterns). This approach has severe limitations with GA4 data.

Missing Data Points

GA4’s BigQuery export intentionally excludes the signals typically used for probabilistic matching:

  • No IP addresses
  • No user agent strings
  • No canvas fingerprints
  • No hardware identifiers

What you have access to:

SELECT DISTINCT
device.category, -- mobile, desktop, tablet
device.operating_system, -- Android, iOS, Windows, etc.
device.browser, -- Chrome, Safari, etc.
device.language,
geo.country,
geo.city,
geo.metro
FROM `project.analytics_XXXXX.events_*`
WHERE _table_suffix = FORMAT_DATE('%Y%m%d', CURRENT_DATE() - 1)

These coarse signals create far too many false positives for production use. How many users in an average city use Chrome on a MacBook? Thousands.

The Cost of False Matches

Incorrectly merged profiles compound over time. When you merge User A’s browsing history with User B’s purchase history:

  • Recommendations become nonsensical
  • Email personalization references products they never viewed
  • Attribution models credit the wrong touchpoints
  • Customer support sees a confusing activity history

Stick to deterministic matching via user_id. The lower match rate is worth the data integrity.


GA4’s Consent Mode V2 fundamentally changes what data reaches BigQuery. When analytics_storage is denied, the behavior depends on your implementation mode.

With Basic Consent Mode, denied consent means no data collection at all. Nothing reaches BigQuery.

With Advanced Consent Mode, GA4 sends “cookieless pings” (events without identifiers). In BigQuery, these appear with:

  • NULL for user_pseudo_id
  • No ga_session_id in event_params
  • privacy_info.analytics_storage = 'No'

These events cannot be stitched or sessionized.

-- Check consent distribution in your data
SELECT
privacy_info.analytics_storage AS consent_status,
COUNT(*) AS events,
COUNT(DISTINCT user_pseudo_id) AS unique_devices,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct_of_total
FROM `project.analytics_XXXXX.events_*`
WHERE _table_suffix >= FORMAT_DATE('%Y%m%d', CURRENT_DATE() - 30)
GROUP BY privacy_info.analytics_storage
ORDER BY events DESC

The Same-Page Backstitch Nuance

If a user grants consent on the same page where they initially denied it, GA4 retroactively applies the Client ID to previously denied hits from that page. This appears correctly in daily tables but may show inconsistencies in intraday tables.

This same-page backstitching doesn’t work across page loads. Once someone navigates away with denied consent, those events stay permanently anonymous.

Practical Approach

For clean analysis, filter to consented data:

SELECT
event_date,
user_pseudo_id,
user_id,
event_name,
event_timestamp
FROM `project.analytics_XXXXX.events_*`
WHERE privacy_info.analytics_storage = 'Yes'
OR privacy_info.analytics_storage IS NULL -- Pre-consent-mode data

Accept that BigQuery numbers will be lower than GA4 UI reports. The UI applies behavioral modeling to estimate what the denied-consent users would have done.


Edge Cases That Break Naive Implementations

Production identity resolution must handle scenarios that simple SQL patterns miss.

Shared Devices

When multiple users authenticate on the same browser, such as a family computer, shared work machine, or retail kiosk, a single user_pseudo_id maps to multiple user_id values. Naive last-touch stitching incorrectly attributes one user’s anonymous browsing to another user.

-- Detect shared devices
SELECT
user_pseudo_id,
COUNT(DISTINCT user_id) AS distinct_users,
ARRAY_AGG(DISTINCT user_id IGNORE NULLS ORDER BY user_id) AS user_ids
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
GROUP BY user_pseudo_id
HAVING COUNT(DISTINCT user_id) > 1
ORDER BY distinct_users DESC
LIMIT 100

Set a threshold, commonly 3 distinct users, and either:

  • Exclude these devices from cross-session stitching entirely
  • Flag them and only apply session-scoped stitching
  • Weight their contribution lower in aggregate metrics
-- Create a shared device flag for your mapping table
SELECT
user_pseudo_id,
COUNT(DISTINCT user_id) AS user_count,
COUNT(DISTINCT user_id) > 3 AS is_shared_device
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
GROUP BY user_pseudo_id

Logout Implementation Errors

A common bug is setting user_id to the string 'null', '-', 'none', or 'anonymous' instead of actual null. This makes GA4 treat all logged-out sessions as belonging to a single user named “null.”

-- Find suspicious user_id values
SELECT
user_id,
COUNT(*) AS events,
COUNT(DISTINCT user_pseudo_id) AS devices
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
GROUP BY user_id
HAVING COUNT(DISTINCT user_pseudo_id) > 1000 -- Suspiciously high
ORDER BY devices DESC

If you find this in your data, work with your implementation team to fix the tagging. In the meantime, filter these values:

WHERE user_id IS NOT NULL
AND user_id NOT IN ('null', 'undefined', 'none', '-', 'anonymous', '')
AND LENGTH(user_id) > 3 -- Filter obvious garbage

Multiple user_id Values per Person

Real users sometimes have multiple authenticated identities:

  • Account merges after acquisition
  • Email address changes
  • Separate work and personal accounts
  • Test accounts used alongside real accounts

Your stitching logic needs a clear decision rule:

-- Option A: Most recent user_id wins
ARRAY_AGG(user_id ORDER BY last_seen DESC LIMIT 1)[SAFE_OFFSET(0)]
-- Option B: First user_id wins (stable identity)
ARRAY_AGG(user_id ORDER BY first_seen ASC LIMIT 1)[SAFE_OFFSET(0)]
-- Option C: Most active user_id wins
ARRAY_AGG(user_id ORDER BY event_count DESC LIMIT 1)[SAFE_OFFSET(0)]

Document your choice and apply it consistently.

Users who clear cookies, use incognito mode, or have Safari ITP restrictions generate new user_pseudo_id values unpredictably. If they later authenticate, your mapping table reconnects these identities. But anonymous sessions before the reset remain orphaned. This is unavoidable without probabilistic matching.

Track fragmentation rates to understand your data quality:

-- Users with multiple devices (potential fragmentation or cross-device)
SELECT
device_count,
COUNT(*) AS users,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct
FROM (
SELECT
user_id,
COUNT(DISTINCT user_pseudo_id) AS device_count
FROM `project.analytics_XXXXX.events_*`
WHERE user_id IS NOT NULL
GROUP BY user_id
)
GROUP BY device_count
ORDER BY device_count

Production Implementation with dbt

For production pipelines, structure your identity resolution as a proper dbt DAG with incremental processing.

The Identity Mapping Model

-- models/intermediate/ga4/int__ga4__identity_mapping.sql
{{
config(
materialized='incremental',
unique_key='user_pseudo_id',
incremental_strategy='merge',
partition_by={
'field': 'last_seen_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by=['user_pseudo_id']
)
}}
WITH source_events AS (
SELECT
user_pseudo_id,
user_id,
event_timestamp,
PARSE_DATE('%Y%m%d', event_date) AS event_date
FROM {{ source('ga4', 'events') }}
WHERE user_pseudo_id IS NOT NULL
AND user_id IS NOT NULL
AND user_id NOT IN ('null', 'undefined', 'none', '')
{% if is_incremental() %}
AND PARSE_DATE('%Y%m%d', event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
{% endif %}
),
aggregated AS (
SELECT
user_pseudo_id,
ARRAY_AGG(DISTINCT user_id IGNORE NULLS) AS all_user_ids,
ARRAY_AGG(user_id ORDER BY event_timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] AS latest_user_id,
MAX(event_timestamp) AS last_seen_timestamp,
MAX(event_date) AS last_seen_date
FROM source_events
GROUP BY user_pseudo_id
)
SELECT
user_pseudo_id,
latest_user_id AS resolved_user_id,
all_user_ids,
ARRAY_LENGTH(all_user_ids) AS user_id_count,
ARRAY_LENGTH(all_user_ids) > 1 AS has_multiple_users,
TIMESTAMP_MICROS(last_seen_timestamp) AS last_seen_at,
last_seen_date
FROM aggregated

The 3-day lookback window handles late-arriving data while keeping incremental runs efficient.

Stitched Events Model

-- models/intermediate/ga4/int__ga4__events_stitched.sql
{{
config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
'field': 'event_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by=['resolved_user_id', 'session_id']
)
}}
WITH events AS (
SELECT
user_pseudo_id,
user_id,
event_name,
event_timestamp,
event_params,
PARSE_DATE('%Y%m%d', event_date) AS event_date_parsed,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id
FROM {{ source('ga4', 'events') }}
WHERE TRUE
{% if is_incremental() %}
AND PARSE_DATE('%Y%m%d', event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)
{% endif %}
),
identity_mapping AS (
SELECT
user_pseudo_id,
resolved_user_id,
has_multiple_users
FROM {{ ref('int__ga4__identity_mapping') }}
)
SELECT
e.user_pseudo_id,
e.user_id,
e.event_name,
e.event_timestamp,
e.event_date_parsed AS event_date,
-- Identity resolution
COALESCE(m.resolved_user_id, e.user_pseudo_id) AS resolved_user_id,
m.resolved_user_id IS NOT NULL AS is_identified,
COALESCE(m.has_multiple_users, FALSE) AS is_shared_device,
-- Session identifiers
e.ga_session_id,
CONCAT(
COALESCE(m.resolved_user_id, e.user_pseudo_id),
'_',
CAST(e.ga_session_id AS STRING)
) AS session_id
FROM events e
LEFT JOIN identity_mapping m
USING (user_pseudo_id)

DAG Structure

sources/ga4/events
├── int__ga4__identity_mapping (incremental, merge)
└── int__ga4__events_stitched (incremental, insert_overwrite)
├── int__ga4__sessions
└── mrt__analytics__user_journey

Schema Tests

models/intermediate/ga4/_int__ga4__models.yml
models:
- name: int__ga4__identity_mapping
columns:
- name: user_pseudo_id
tests:
- unique
- not_null
- name: resolved_user_id
tests:
- not_null
- name: user_id_count
tests:
- dbt_utils.accepted_range:
min_value: 1
max_value: 100 # Alert on snowballing
- name: int__ga4__events_stitched
columns:
- name: session_id
tests:
- not_null
- name: resolved_user_id
tests:
- not_null

Validation and Monitoring

Identity resolution requires ongoing monitoring. Bugs in your stitching logic can silently corrupt months of data.

Key Metrics Dashboard

-- Daily identity resolution health metrics
WITH daily_stats AS (
SELECT
event_date,
COUNT(*) AS total_events,
COUNTIF(is_identified) AS identified_events,
COUNTIF(is_shared_device) AS shared_device_events,
COUNT(DISTINCT resolved_user_id) AS unique_users,
COUNT(DISTINCT user_pseudo_id) AS unique_devices,
COUNT(DISTINCT session_id) AS unique_sessions
FROM {{ ref('int__ga4__events_stitched') }}
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY event_date
)
SELECT
event_date,
total_events,
unique_users,
unique_devices,
unique_sessions,
-- Stitch rate: % of events with resolved identity
ROUND(identified_events * 100.0 / total_events, 2) AS stitch_rate_pct,
-- Device consolidation: how much stitching reduces user count
ROUND((unique_devices - unique_users) * 100.0 / unique_devices, 2) AS consolidation_rate_pct,
-- Shared device exposure
ROUND(shared_device_events * 100.0 / total_events, 2) AS shared_device_pct
FROM daily_stats
ORDER BY event_date DESC

Anomaly Detection

Set up alerts for sudden changes that indicate bugs or implementation issues:

-- Alert: stitch rate dropped more than 10% week-over-week
WITH weekly AS (
SELECT
DATE_TRUNC(event_date, WEEK) AS week,
COUNTIF(is_identified) * 100.0 / COUNT(*) AS stitch_rate
FROM {{ ref('int__ga4__events_stitched') }}
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 8 WEEK)
GROUP BY 1
)
SELECT
week,
stitch_rate,
LAG(stitch_rate) OVER (ORDER BY week) AS prev_week_rate,
stitch_rate - LAG(stitch_rate) OVER (ORDER BY week) AS change
FROM weekly
HAVING ABS(change) > 10

Data Quality Checks

-- Identify potential data quality issues
SELECT
'Excessive devices per user' AS issue,
COUNT(*) AS affected_users
FROM {{ ref('int__ga4__identity_mapping') }}
WHERE user_id_count > 50
UNION ALL
SELECT
'Suspiciously common user_id' AS issue,
COUNT(DISTINCT user_pseudo_id) AS affected_devices
FROM {{ source('ga4', 'events') }}
WHERE user_id IN (
SELECT user_id
FROM {{ source('ga4', 'events') }}
WHERE user_id IS NOT NULL
GROUP BY user_id
HAVING COUNT(DISTINCT user_pseudo_id) > 500
)

When NOT to Stitch

Identity resolution isn’t always appropriate. Know when to use raw identifiers instead.

Data Integrity Concerns

Don’t stitch when your authentication system has unresolved edge cases:

  • Guest checkout with temporary/recycled IDs
  • Demo accounts used repeatedly by sales teams
  • Employee testing on production
  • Partner or affiliate tracking that reuses IDs

Validate your user_id implementation before building stitching logic on top of it.

Device-Specific Analysis

Some analyses should use raw user_pseudo_id:

  • Performance optimization: Page load times by device/browser
  • Browser compatibility: Feature support and error rates
  • Device category analysis: Mobile vs desktop conversion rates
  • Bot detection: Identifying non-human traffic patterns

Identity resolution creates new personal data linkages. Before implementing:

  • Review whether your privacy policy covers cross-device tracking
  • Confirm consent mechanisms address identity resolution
  • Consider data retention implications of the identity graph
  • Document the stitching logic for compliance purposes

This isn’t legal advice. Involve your legal and privacy teams.


Conclusion

User stitching transforms fragmented GA4 data into coherent customer journeys. The key principles:

  1. Start conservative: Begin with session-scoped stitching, graduate to cross-session only after validating data quality.
  2. Use deterministic matching only: Resist the temptation to inflate match rates with probabilistic approaches. The false positives aren’t worth it.
  3. Handle edge cases explicitly: Shared devices, logout bugs, and consent gaps need specific logic, not assumptions.
  4. Monitor continuously: Stitch rates, device counts, and anomaly detection should be part of your data quality infrastructure.
  5. Document everything: Which marts use stitched vs raw identities? What’s your decision rule for multiple user_id values? Future you will thank present you.

The identity mapping table you build here becomes foundational for downstream analysis: session tables need it for accurate session attribution, user-level aggregations need it for LTV calculations, and attribution models need it for cross-device journey reconstruction.

In the final article of this series, we’ll bring everything together into a production-ready dbt project template, including the identity resolution patterns from this article integrated with the schema, unnesting, and session logic from previous installments.