Adrienne Vermorel
GA4 User Stitching: Handling Anonymous to Known Users
This is part 4 of a 5-part series on GA4 and BigQuery. Previous articles covered the schema reference, unnesting patterns, and session tables.
When users browse your site anonymously then log in on a different device days later, GA4’s BigQuery export treats them as entirely separate people. Unlike the GA4 interface, which applies identity resolution behind the scenes, BigQuery receives raw event data without any stitching. Your user counts are inflated, attribution models are fractured, and customer lifetime value calculations miss the complete picture.
The solution is building your own identity resolution pipeline. Done well, stitching can recover part of the sessions that would otherwise appear fragmented. Done poorly, it creates data integrity nightmares that erode trust in your analytics.
This article walks through the complete implementation: understanding GA4’s identity model, SQL patterns for different stitching approaches, handling edge cases, and building production-ready dbt models.
Understanding GA4’s Two Identity Systems
GA4 tracks users through two fundamentally different identifiers that appear in every BigQuery event row. Understanding their behavior is essential before writing any stitching logic.
user_pseudo_id: The Device Identifier
The user_pseudo_id is GA4’s device-level identifier (essentially the same concept as Universal Analytics’ Client ID). On web, it’s derived from the _ga first-party cookie with a format like 1197596843.1673515099 (random number + timestamp). On mobile apps, it’s the Firebase App Instance ID, unique per installation.
This identifier persists for up to two years by default, but browser privacy features increasingly limit this:
- Safari’s ITP limits JavaScript-set cookies to roughly 7 days
- Chrome now caps cookies at 400 days
- Incognito sessions generate ephemeral IDs
The user_pseudo_id is always present in BigQuery. It’s GA4’s fallback identifier when no authenticated identity exists.
user_id: The Authenticated Identifier
The user_id is your business-provided identifier, sent only when you explicitly implement it:
// gtag.js implementationgtag('config', 'G-XXXXXXX', { 'user_id': 'USER_12345'});
// Or via GTM dataLayerdataLayer.push({ 'user_id': 'USER_12345'});Unlike user_pseudo_id, GA4 doesn’t generate this automatically. You must capture it from your authentication system. When users aren’t logged in, this field is NULL.
Important: never send PII like email addresses directly. Use hashed or anonymized database IDs instead.
-- Examining identity fields in your dataSELECT user_pseudo_id, user_id, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id, event_name, TIMESTAMP_MICROS(event_timestamp) AS event_timeFROM `project.analytics_XXXXX.events_*`WHERE _table_suffix = FORMAT_DATE('%Y%m%d', CURRENT_DATE() - 1)LIMIT 100Why BigQuery Differs from GA4 UI Reports
GA4’s interface applies identity resolution automatically through “Reporting Identity.” The platform offers three modes:
| Mode | Resolution Order |
|---|---|
| Blended | User ID → Device ID → Modeling |
| Observed | User ID → Device ID |
| Device-based | Device ID only |
For BigQuery work, the key point is that none of this stitching logic transfers to your export. BigQuery receives raw, event-level data with whatever user_id was present at collection time. No modeling, no cross-device resolution, no Google Signals data.
When comparing GA4 UI to BigQuery queries, expect discrepancies from:
- Behavioral modeling in the UI (BigQuery doesn’t export modeled data)
- Google Signals data (never appears in BigQuery)
- HyperLogLog++ estimation for large cardinality counts in the UI
For analytics engineering work, treat BigQuery as your source of truth and accept that UI numbers will differ.
Stitching Techniques with BigQuery SQL
There’s no single “correct” stitching approach. The right choice depends on your use case, data quality, and tolerance for false matches. Here are the four main patterns.
Last-Touch Attribution
Last-touch attribution assigns the most recent known user_id to all historical events from the same user_pseudo_id. This handles the typical flow: anonymous browsing followed by authentication.
WITH events_with_last_touch AS ( SELECT event_date, user_pseudo_id, user_id, event_name, event_timestamp, LAST_VALUE(user_id IGNORE NULLS) OVER ( PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ) AS stitched_user_id FROM `project.analytics_XXXXX.events_*` WHERE _table_suffix BETWEEN '20250101' AND '20250131')
SELECT event_date, user_pseudo_id, user_id AS original_user_id, COALESCE(stitched_user_id, user_pseudo_id) AS resolved_user_id, event_name, TIMESTAMP_MICROS(event_timestamp) AS event_timeFROM events_with_last_touchThe IGNORE NULLS clause is essential. Without it, LAST_VALUE returns NULL whenever the current row’s user_id is null. The window frame ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW ensures you’re looking at all prior events in the partition.
This only stitches events before the identification moment. Events after authentication already have user_id populated. If you need to connect a user’s activity across multiple authentication sessions, you need full backstitching.
Full Backstitching via Mapping Table
For complete identity resolution (applying a user_id to all events from a device, regardless of when authentication occurred), build a mapping table first:
-- Step 1: Build the identity mappingCREATE OR REPLACE TABLE `project.dataset.user_identity_mapping` ASWITH identity_pairs AS ( SELECT user_pseudo_id, user_id, MAX(event_timestamp) AS last_seen_timestamp FROM `project.analytics_XXXXX.events_*` WHERE user_id IS NOT NULL AND user_pseudo_id IS NOT NULL GROUP BY user_pseudo_id, user_id)
SELECT user_pseudo_id, -- Take the most recent user_id if multiple exist ARRAY_AGG(user_id ORDER BY last_seen_timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] AS resolved_user_id, ARRAY_AGG(DISTINCT user_id) AS all_user_ids, MIN(TIMESTAMP_MICROS(last_seen_timestamp)) AS first_identified, MAX(TIMESTAMP_MICROS(last_seen_timestamp)) AS last_identifiedFROM identity_pairsGROUP BY user_pseudo_id-- Step 2: Apply stitching via JOINSELECT e.event_date, e.user_pseudo_id, e.user_id AS original_user_id, COALESCE(m.resolved_user_id, e.user_pseudo_id) AS stitched_user_id, e.event_nameFROM `project.analytics_XXXXX.events_*` eLEFT JOIN `project.dataset.user_identity_mapping` m USING (user_pseudo_id)WHERE e._table_suffix BETWEEN '20250101' AND '20250131'This applies the identified user_id to every event from that device (past, present, and future), enabling true cross-session analysis.
First-Touch Attribution
First-touch attribution uses FIRST_VALUE instead of LAST_VALUE, assigning the earliest known user_id to all events. This approach makes sense for user acquisition analysis where you want to attribute the user to their original authenticated identity, not account changes later.
SELECT event_date, user_pseudo_id, user_id, event_name, event_timestamp, FIRST_VALUE(user_id IGNORE NULLS) OVER ( PARTITION BY user_pseudo_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS first_known_user_idFROM `project.analytics_XXXXX.events_*`Use first-touch when analyzing conversion paths from anonymous to known user. Use last-touch when current identity matters most (e.g., personalization, current account status).
Session-Scoped Stitching
Session-scoped stitching limits identity resolution to events within the same session. This conservative approach respects the natural boundary GA4 draws around user behavior.
WITH session_events AS ( SELECT event_date, user_pseudo_id, user_id, event_name, event_timestamp, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id FROM `project.analytics_XXXXX.events_*` WHERE _table_suffix = FORMAT_DATE('%Y%m%d', CURRENT_DATE() - 1)),
stitched AS ( SELECT *, FIRST_VALUE(user_id IGNORE NULLS) OVER ( PARTITION BY user_pseudo_id, ga_session_id ORDER BY event_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS session_user_id FROM session_events)
SELECT event_date, user_pseudo_id, user_id, event_name, event_timestamp, ga_session_id, session_user_id, COALESCE(session_user_id, user_pseudo_id) AS resolved_user_id, -- Create truly unique session identifier CONCAT( COALESCE(session_user_id, user_pseudo_id), '_', CAST(ga_session_id AS STRING) ) AS unique_session_idFROM stitchedga_session_id is not globally unique. It’s the Unix timestamp of session start. Always concatenate it with user_pseudo_id (or your stitched ID) to create a truly unique session identifier.
Choosing the Right Approach
| Technique | Scope | Risk Level | Best For |
|---|---|---|---|
| Last-touch | Forward-looking | Low | Basic analytics, recent behavior |
| First-touch | All events | Medium | Acquisition attribution |
| Full backstitch | All events | Higher | Cross-device, LTV analysis |
| Session-scoped | Single session | Lowest | Conservative approach, testing |
Start with session-scoped stitching to validate your logic, then graduate to full backstitching once you’ve confirmed data quality.
Cross-Device Identity Resolution
User stitching becomes particularly valuable when connecting journeys across devices. A single authenticated user might have three user_pseudo_id values: one from their phone, one from their laptop, and one from a work computer. Your identity mapping table is the foundation for unifying these.
Building a Production Identity Mapping Table
The optimal schema stores the relationship from user_id to all associated devices:
CREATE OR REPLACE TABLE `project.dataset.identity_graph` ASWITH device_user_pairs AS ( SELECT user_id, user_pseudo_id, MIN(event_timestamp) AS first_seen, MAX(event_timestamp) AS last_seen, COUNT(*) AS event_count FROM `project.analytics_XXXXX.events_*` WHERE user_id IS NOT NULL AND user_pseudo_id IS NOT NULL GROUP BY user_id, user_pseudo_id)
SELECT user_id, ARRAY_AGG( STRUCT( user_pseudo_id, TIMESTAMP_MICROS(first_seen) AS first_seen, TIMESTAMP_MICROS(last_seen) AS last_seen, event_count ) ORDER BY last_seen DESC LIMIT 100 -- Prevent snowballing from tracking bugs ) AS devices, COUNT(DISTINCT user_pseudo_id) AS device_count, MIN(TIMESTAMP_MICROS(first_seen)) AS user_first_seen, MAX(TIMESTAMP_MICROS(last_seen)) AS user_last_activeFROM device_user_pairsGROUP BY user_idSet a limit on identifiers per user (around 100). If a single user_id accumulates thousands of user_pseudo_id values, something is wrong. Possible causes include a reset device ID on every page load, or a test account being reused.
For querying, you’ll often want the reverse mapping (device → user):
CREATE OR REPLACE TABLE `project.dataset.device_to_user_mapping` ASSELECT device.user_pseudo_id, user_id, device.first_seen, device.last_seenFROM `project.dataset.identity_graph`,UNNEST(devices) AS deviceHandling Users Who Never Authenticate
Not every visitor will log in. Your stitching logic must gracefully handle anonymous users:
SELECT COALESCE(m.user_id, e.user_pseudo_id) AS resolved_user_id, CASE WHEN m.user_id IS NOT NULL THEN 'identified' ELSE 'anonymous' END AS identity_status, e.event_date, e.user_pseudo_id, e.user_id, e.event_name, e.event_timestampFROM `project.analytics_XXXXX.events_*` eLEFT JOIN `project.dataset.device_to_user_mapping` m USING (user_pseudo_id)Accept that anonymous users remain separate identities. Don’t force probabilistic matching to inflate your identified user percentage. The false positives probably aren’t worth it.
Why Probabilistic Matching Fails in GA4
You might consider increasing match rates through fingerprinting (matching users by device characteristics, location, and behavior patterns). This approach has severe limitations with GA4 data.
Missing Data Points
GA4’s BigQuery export intentionally excludes the signals typically used for probabilistic matching:
- No IP addresses
- No user agent strings
- No canvas fingerprints
- No hardware identifiers
What you have access to:
SELECT DISTINCT device.category, -- mobile, desktop, tablet device.operating_system, -- Android, iOS, Windows, etc. device.browser, -- Chrome, Safari, etc. device.language, geo.country, geo.city, geo.metroFROM `project.analytics_XXXXX.events_*`WHERE _table_suffix = FORMAT_DATE('%Y%m%d', CURRENT_DATE() - 1)These coarse signals create far too many false positives for production use. How many users in an average city use Chrome on a MacBook? Thousands.
The Cost of False Matches
Incorrectly merged profiles compound over time. When you merge User A’s browsing history with User B’s purchase history:
- Recommendations become nonsensical
- Email personalization references products they never viewed
- Attribution models credit the wrong touchpoints
- Customer support sees a confusing activity history
Stick to deterministic matching via user_id. The lower match rate is worth the data integrity.
Consent Mode’s Impact on Identity
GA4’s Consent Mode V2 fundamentally changes what data reaches BigQuery. When analytics_storage is denied, the behavior depends on your implementation mode.
Basic vs Advanced Consent Mode
With Basic Consent Mode, denied consent means no data collection at all. Nothing reaches BigQuery.
With Advanced Consent Mode, GA4 sends “cookieless pings” (events without identifiers). In BigQuery, these appear with:
NULLforuser_pseudo_id- No
ga_session_idin event_params privacy_info.analytics_storage = 'No'
These events cannot be stitched or sessionized.
-- Check consent distribution in your dataSELECT privacy_info.analytics_storage AS consent_status, COUNT(*) AS events, COUNT(DISTINCT user_pseudo_id) AS unique_devices, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct_of_totalFROM `project.analytics_XXXXX.events_*`WHERE _table_suffix >= FORMAT_DATE('%Y%m%d', CURRENT_DATE() - 30)GROUP BY privacy_info.analytics_storageORDER BY events DESCThe Same-Page Backstitch Nuance
If a user grants consent on the same page where they initially denied it, GA4 retroactively applies the Client ID to previously denied hits from that page. This appears correctly in daily tables but may show inconsistencies in intraday tables.
This same-page backstitching doesn’t work across page loads. Once someone navigates away with denied consent, those events stay permanently anonymous.
Practical Approach
For clean analysis, filter to consented data:
SELECT event_date, user_pseudo_id, user_id, event_name, event_timestampFROM `project.analytics_XXXXX.events_*`WHERE privacy_info.analytics_storage = 'Yes' OR privacy_info.analytics_storage IS NULL -- Pre-consent-mode dataAccept that BigQuery numbers will be lower than GA4 UI reports. The UI applies behavioral modeling to estimate what the denied-consent users would have done.
Edge Cases That Break Naive Implementations
Production identity resolution must handle scenarios that simple SQL patterns miss.
Shared Devices
When multiple users authenticate on the same browser, such as a family computer, shared work machine, or retail kiosk, a single user_pseudo_id maps to multiple user_id values. Naive last-touch stitching incorrectly attributes one user’s anonymous browsing to another user.
-- Detect shared devicesSELECT user_pseudo_id, COUNT(DISTINCT user_id) AS distinct_users, ARRAY_AGG(DISTINCT user_id IGNORE NULLS ORDER BY user_id) AS user_idsFROM `project.analytics_XXXXX.events_*`WHERE user_id IS NOT NULLGROUP BY user_pseudo_idHAVING COUNT(DISTINCT user_id) > 1ORDER BY distinct_users DESCLIMIT 100Set a threshold, commonly 3 distinct users, and either:
- Exclude these devices from cross-session stitching entirely
- Flag them and only apply session-scoped stitching
- Weight their contribution lower in aggregate metrics
-- Create a shared device flag for your mapping tableSELECT user_pseudo_id, COUNT(DISTINCT user_id) AS user_count, COUNT(DISTINCT user_id) > 3 AS is_shared_deviceFROM `project.analytics_XXXXX.events_*`WHERE user_id IS NOT NULLGROUP BY user_pseudo_idLogout Implementation Errors
A common bug is setting user_id to the string 'null', '-', 'none', or 'anonymous' instead of actual null. This makes GA4 treat all logged-out sessions as belonging to a single user named “null.”
-- Find suspicious user_id valuesSELECT user_id, COUNT(*) AS events, COUNT(DISTINCT user_pseudo_id) AS devicesFROM `project.analytics_XXXXX.events_*`WHERE user_id IS NOT NULLGROUP BY user_idHAVING COUNT(DISTINCT user_pseudo_id) > 1000 -- Suspiciously highORDER BY devices DESCIf you find this in your data, work with your implementation team to fix the tagging. In the meantime, filter these values:
WHERE user_id IS NOT NULL AND user_id NOT IN ('null', 'undefined', 'none', '-', 'anonymous', '') AND LENGTH(user_id) > 3 -- Filter obvious garbageMultiple user_id Values per Person
Real users sometimes have multiple authenticated identities:
- Account merges after acquisition
- Email address changes
- Separate work and personal accounts
- Test accounts used alongside real accounts
Your stitching logic needs a clear decision rule:
-- Option A: Most recent user_id winsARRAY_AGG(user_id ORDER BY last_seen DESC LIMIT 1)[SAFE_OFFSET(0)]
-- Option B: First user_id wins (stable identity)ARRAY_AGG(user_id ORDER BY first_seen ASC LIMIT 1)[SAFE_OFFSET(0)]
-- Option C: Most active user_id winsARRAY_AGG(user_id ORDER BY event_count DESC LIMIT 1)[SAFE_OFFSET(0)]Document your choice and apply it consistently.
Cookie Resets and Fragmentation
Users who clear cookies, use incognito mode, or have Safari ITP restrictions generate new user_pseudo_id values unpredictably. If they later authenticate, your mapping table reconnects these identities. But anonymous sessions before the reset remain orphaned. This is unavoidable without probabilistic matching.
Track fragmentation rates to understand your data quality:
-- Users with multiple devices (potential fragmentation or cross-device)SELECT device_count, COUNT(*) AS users, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pctFROM ( SELECT user_id, COUNT(DISTINCT user_pseudo_id) AS device_count FROM `project.analytics_XXXXX.events_*` WHERE user_id IS NOT NULL GROUP BY user_id)GROUP BY device_countORDER BY device_countProduction Implementation with dbt
For production pipelines, structure your identity resolution as a proper dbt DAG with incremental processing.
The Identity Mapping Model
-- models/intermediate/ga4/int__ga4__identity_mapping.sql
{{ config( materialized='incremental', unique_key='user_pseudo_id', incremental_strategy='merge', partition_by={ 'field': 'last_seen_date', 'data_type': 'date', 'granularity': 'day' }, cluster_by=['user_pseudo_id'] )}}
WITH source_events AS ( SELECT user_pseudo_id, user_id, event_timestamp, PARSE_DATE('%Y%m%d', event_date) AS event_date FROM {{ source('ga4', 'events') }} WHERE user_pseudo_id IS NOT NULL AND user_id IS NOT NULL AND user_id NOT IN ('null', 'undefined', 'none', '') {% if is_incremental() %} AND PARSE_DATE('%Y%m%d', event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) {% endif %}),
aggregated AS ( SELECT user_pseudo_id, ARRAY_AGG(DISTINCT user_id IGNORE NULLS) AS all_user_ids, ARRAY_AGG(user_id ORDER BY event_timestamp DESC LIMIT 1)[SAFE_OFFSET(0)] AS latest_user_id, MAX(event_timestamp) AS last_seen_timestamp, MAX(event_date) AS last_seen_date FROM source_events GROUP BY user_pseudo_id)
SELECT user_pseudo_id, latest_user_id AS resolved_user_id, all_user_ids, ARRAY_LENGTH(all_user_ids) AS user_id_count, ARRAY_LENGTH(all_user_ids) > 1 AS has_multiple_users, TIMESTAMP_MICROS(last_seen_timestamp) AS last_seen_at, last_seen_dateFROM aggregatedThe 3-day lookback window handles late-arriving data while keeping incremental runs efficient.
Stitched Events Model
-- models/intermediate/ga4/int__ga4__events_stitched.sql
{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={ 'field': 'event_date', 'data_type': 'date', 'granularity': 'day' }, cluster_by=['resolved_user_id', 'session_id'] )}}
WITH events AS ( SELECT user_pseudo_id, user_id, event_name, event_timestamp, event_params, PARSE_DATE('%Y%m%d', event_date) AS event_date_parsed, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id FROM {{ source('ga4', 'events') }} WHERE TRUE {% if is_incremental() %} AND PARSE_DATE('%Y%m%d', event_date) >= DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY) {% endif %}),
identity_mapping AS ( SELECT user_pseudo_id, resolved_user_id, has_multiple_users FROM {{ ref('int__ga4__identity_mapping') }})
SELECT e.user_pseudo_id, e.user_id, e.event_name, e.event_timestamp, e.event_date_parsed AS event_date,
-- Identity resolution COALESCE(m.resolved_user_id, e.user_pseudo_id) AS resolved_user_id, m.resolved_user_id IS NOT NULL AS is_identified, COALESCE(m.has_multiple_users, FALSE) AS is_shared_device,
-- Session identifiers e.ga_session_id, CONCAT( COALESCE(m.resolved_user_id, e.user_pseudo_id), '_', CAST(e.ga_session_id AS STRING) ) AS session_id
FROM events eLEFT JOIN identity_mapping m USING (user_pseudo_id)DAG Structure
sources/ga4/events │ ├── int__ga4__identity_mapping (incremental, merge) │ └── int__ga4__events_stitched (incremental, insert_overwrite) │ ├── int__ga4__sessions │ └── mrt__analytics__user_journeySchema Tests
models: - name: int__ga4__identity_mapping columns: - name: user_pseudo_id tests: - unique - not_null - name: resolved_user_id tests: - not_null - name: user_id_count tests: - dbt_utils.accepted_range: min_value: 1 max_value: 100 # Alert on snowballing
- name: int__ga4__events_stitched columns: - name: session_id tests: - not_null - name: resolved_user_id tests: - not_nullValidation and Monitoring
Identity resolution requires ongoing monitoring. Bugs in your stitching logic can silently corrupt months of data.
Key Metrics Dashboard
-- Daily identity resolution health metricsWITH daily_stats AS ( SELECT event_date, COUNT(*) AS total_events, COUNTIF(is_identified) AS identified_events, COUNTIF(is_shared_device) AS shared_device_events, COUNT(DISTINCT resolved_user_id) AS unique_users, COUNT(DISTINCT user_pseudo_id) AS unique_devices, COUNT(DISTINCT session_id) AS unique_sessions FROM {{ ref('int__ga4__events_stitched') }} WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) GROUP BY event_date)
SELECT event_date, total_events, unique_users, unique_devices, unique_sessions,
-- Stitch rate: % of events with resolved identity ROUND(identified_events * 100.0 / total_events, 2) AS stitch_rate_pct,
-- Device consolidation: how much stitching reduces user count ROUND((unique_devices - unique_users) * 100.0 / unique_devices, 2) AS consolidation_rate_pct,
-- Shared device exposure ROUND(shared_device_events * 100.0 / total_events, 2) AS shared_device_pct
FROM daily_statsORDER BY event_date DESCAnomaly Detection
Set up alerts for sudden changes that indicate bugs or implementation issues:
-- Alert: stitch rate dropped more than 10% week-over-weekWITH weekly AS ( SELECT DATE_TRUNC(event_date, WEEK) AS week, COUNTIF(is_identified) * 100.0 / COUNT(*) AS stitch_rate FROM {{ ref('int__ga4__events_stitched') }} WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 8 WEEK) GROUP BY 1)
SELECT week, stitch_rate, LAG(stitch_rate) OVER (ORDER BY week) AS prev_week_rate, stitch_rate - LAG(stitch_rate) OVER (ORDER BY week) AS changeFROM weeklyHAVING ABS(change) > 10Data Quality Checks
-- Identify potential data quality issuesSELECT 'Excessive devices per user' AS issue, COUNT(*) AS affected_usersFROM {{ ref('int__ga4__identity_mapping') }}WHERE user_id_count > 50
UNION ALL
SELECT 'Suspiciously common user_id' AS issue, COUNT(DISTINCT user_pseudo_id) AS affected_devicesFROM {{ source('ga4', 'events') }}WHERE user_id IN ( SELECT user_id FROM {{ source('ga4', 'events') }} WHERE user_id IS NOT NULL GROUP BY user_id HAVING COUNT(DISTINCT user_pseudo_id) > 500)When NOT to Stitch
Identity resolution isn’t always appropriate. Know when to use raw identifiers instead.
Data Integrity Concerns
Don’t stitch when your authentication system has unresolved edge cases:
- Guest checkout with temporary/recycled IDs
- Demo accounts used repeatedly by sales teams
- Employee testing on production
- Partner or affiliate tracking that reuses IDs
Validate your user_id implementation before building stitching logic on top of it.
Device-Specific Analysis
Some analyses should use raw user_pseudo_id:
- Performance optimization: Page load times by device/browser
- Browser compatibility: Feature support and error rates
- Device category analysis: Mobile vs desktop conversion rates
- Bot detection: Identifying non-human traffic patterns
Legal and Privacy Considerations
Identity resolution creates new personal data linkages. Before implementing:
- Review whether your privacy policy covers cross-device tracking
- Confirm consent mechanisms address identity resolution
- Consider data retention implications of the identity graph
- Document the stitching logic for compliance purposes
This isn’t legal advice. Involve your legal and privacy teams.
Conclusion
User stitching transforms fragmented GA4 data into coherent customer journeys. The key principles:
- Start conservative: Begin with session-scoped stitching, graduate to cross-session only after validating data quality.
- Use deterministic matching only: Resist the temptation to inflate match rates with probabilistic approaches. The false positives aren’t worth it.
- Handle edge cases explicitly: Shared devices, logout bugs, and consent gaps need specific logic, not assumptions.
- Monitor continuously: Stitch rates, device counts, and anomaly detection should be part of your data quality infrastructure.
- Document everything: Which marts use stitched vs raw identities? What’s your decision rule for multiple
user_idvalues? Future you will thank present you.
The identity mapping table you build here becomes foundational for downstream analysis: session tables need it for accurate session attribution, user-level aggregations need it for LTV calculations, and attribution models need it for cross-device journey reconstruction.
In the final article of this series, we’ll bring everything together into a production-ready dbt project template, including the identity resolution patterns from this article integrated with the schema, unnesting, and session logic from previous installments.