ServicesAboutNotesContact Get in touch →
EN FR
Note

GA4 BigQuery Schema Hub

Hub note connecting all concepts needed to understand and query the GA4 BigQuery export schema — table types, nested structures, gotchas, and query patterns.

Planted
ga4bigqueryanalyticsdata engineering

The GA4 BigQuery export provides raw, unsampled, event-level analytics data with full SQL access. It uses an event-centric model with deeply nested structures, three distinct traffic source systems, and several schema behaviors that Google’s documentation does not clearly document. This hub organizes the key concepts for working with the schema.

What Gets Exported

GA4 BigQuery Export Table Types — The four table types in an export dataset: daily (events_YYYYMMDD), intraday (events_intraday_YYYYMMDD), and two optional user tables. Their timing, limitations, costs, and when to use each. The intraday vs daily tradeoff (real-time monitoring vs source of truth) and why intraday tables are missing three critical attribution fields.

The Schema

GA4 Event Data Structure — The fundamental model: one row per event, nested RECORDs for device/geo/traffic, REPEATED RECORDs (arrays) for event_params, user_properties, and items. The mental model shift from Universal Analytics. Inline UNNEST vs FROM clause UNNEST. Prerequisite knowledge for everything else here.

GA4 Event Params Type Detection — How GA4 auto-assigns parameter values to string_value, int_value, or double_value fields. The silent null failure when you query the wrong type field. The COALESCE defensive pattern for parameters with uncertain or historically inconsistent types.

GA4 Traffic Source Fields — Three different traffic source structures in the export: traffic_source (user-level first-touch), collected_traffic_source (event-level raw UTMs and click IDs), and session_traffic_source_last_click (session-level last-click, July 2024+). Which one to use for which question.

GA4 Event Ordering with Batch Fields — The three batch fields (batch_event_index, batch_ordering_id, batch_page_id) added July 2024 for deterministic event sequencing. Why event_timestamp alone produces unreliable ordering for funnel and path analysis.

GA4 E-commerce Schema — The ecommerce RECORD (transaction-level) and items REPEATED RECORD (product-level), including the nested item_params array added October 2023. Query patterns for purchase analysis without double-counting revenue.

The Gotchas

GA4 Session Start Event Unreliability — Why counting session_start events produces wrong session counts. GA4’s own interface abandoned this approach. The correct method: count distinct composite session IDs.

GA4 Session Key Constructionga_session_id is a Unix timestamp, not a unique identifier. Multiple users starting sessions at the same second share identical values. The composite user_pseudo_id + ga_session_id key is the correct session identifier.

GA4 BigQuery Timezone Handling — Three timezone contexts that don’t agree: event_timestamp (UTC), event_date (property timezone), and _TABLE_SUFFIX (Pacific Time). The correct pattern for date-range queries that work correctly at day boundaries.

GA4 Consent Mode Orphaned Events — Advanced Consent Mode creates events with null user_pseudo_id and ga_session_id that inflate event counts but can’t be sessionized. Same-page backstitching behavior. Why BigQuery numbers are lower than the GA4 interface for European properties.

GA4 User-Provided Data BigQuery Trap — Enabling “User-provided data” in GA4 Admin permanently disables user_id export to BigQuery. No reversal option. What monitoring to put in place before your team accidentally enables it.

Schema Maintenance

GA4 Schema Evolution Monitoring — GA4 adds new fields without announcements. Fields are never retroactive — historical tables don’t gain columns added after their export date. How to monitor INFORMATION_SCHEMA for changes and handle the transition period when a new field becomes available.

Querying Efficiently

GA4 BigQuery Query Patterns_TABLE_SUFFIX filtering for date-sharded tables, inline UNNEST vs FROM clause UNNEST, reusable dbt macros for parameter extraction, and cost control practices (dry runs, materialization, APPROX_COUNT_DISTINCT).

BigQuery Cost Model — GA4 data at scale can be expensive. Understanding how on-demand and slot-based pricing interact with query patterns shapes how you design base models and intermediate layers.

Why Numbers Don’t Match the GA4 Interface

GA4 BigQuery Number Discrepancies — The 1-5% gap between BigQuery queries and GA4 interface reports is architectural and unavoidable. The causes: HyperLogLog++ probabilistic counting, Consent Mode behavioral modeling, Google Signals cross-device deduplication, and the 72-hour late data processing window.

Sessionization

GA4 Sessionization Hub covers the next layer: event-grain sessionization, window function patterns for session context propagation, incremental model strategies for late-arriving data, and attribution assembly from session-level traffic source fields.