GA4 User Identity

GA4 exports raw event data to BigQuery with two user identifiers: user_pseudo_id (the device cookie, always present) and user_id (your business identifier, present only when you implement it). When users browse anonymously and then authenticate, GA4’s interface applies identity resolution automatically — but none of that logic reaches BigQuery. Building your own stitching pipeline is the only way to connect the complete user journey in the warehouse.

This map of content covers the full identity resolution problem: why it exists, the SQL patterns to solve it, the edge cases that break naive implementations, and the production infrastructure to keep it reliable.

Foundation

GA4 Event Data Structure — The schema fundamentals: what user_pseudo_id and user_id are, how they appear in the event-level export, and why BigQuery requires a different mental model than the GA4 interface.

GA4 Reporting Identity Modes — How GA4’s Blended/Observed/Device-based reporting modes apply identity resolution in the interface, and why none of that processing carries over to BigQuery exports. Explains the structural source of GA4 vs BigQuery user count discrepancies.

Data quality before stitching

GA4 User ID Data Quality — The implementation bugs that corrupt identity data before it reaches your pipeline: the string 'null' logout bug, PII in user_id fields, recycled temporary IDs, and employee/test traffic. Run these checks before building any mapping table.

Stitching techniques

GA4 Identity Stitching Techniques — The four SQL approaches: last-touch (window function), full backstitching (mapping table), first-touch, and session-scoped stitching. Includes a decision framework for choosing between them based on scope, risk, and use case.

GA4 User Backstitching — Deep dive into the full backstitching pattern: the two-step lookup-then-join approach, shared device handling, where it fits in the dbt DAG, and when it adds the most value.

Cross-device resolution

GA4 Identity Graph BigQuery — Building the production identity graph: the user-centric STRUCT array schema, the device-centric reverse mapping, handling multiple user_id values per device, detecting shared devices, and tracking cookie fragmentation.

Probabilistic Matching Limitations in GA4 — Why fingerprinting fails with GA4 data: the signals GA4 intentionally excludes (IP, user agent, canvas fingerprints), what coarse data remains, and the compounding cost of false positives in merged profiles. The case for accepting a lower deterministic match rate.

Consent Mode Impact on Identity Resolution — How Consent Mode V2 changes your BigQuery data: cookieless pings with null identifiers under Advanced mode, the same-page backstitch nuance, filtering consented events for your stitching pipeline, and the architectural requirement to separate consented and non-consented data paths.

Privacy Constraints for Linked Analytics Data — GDPR implications of linking GA4 cookies to CRM records, the CNIL consent exemption that disappears when you build identity-linked models, and the right-to-deletion cascade through your identity graph.

Production infrastructure

dbt Identity Resolution Pipeline — The production dbt DAG: the identity mapping model (incremental merge), the stitched events model (incremental insert_overwrite), schema tests including the device count guard, and why each model uses a different incremental strategy.

Identity Resolution Monitoring — Daily health metrics (stitch rate, consolidation rate, shared device exposure) and week-over-week anomaly detection. What each metric change signals and how to connect monitoring to your broader dbt testing infrastructure.

Source article

GA4 User Stitching: Handling Anonymous to Known Users — The full implementation walkthrough, including complete dbt model code, edge case patterns, and the decision matrix for choosing between techniques. Part 4 of the GA4 + BigQuery series; the complete series is covered by the GA4 Sessionization Hub.