Adrienne Vermorel
GA4 BigQuery Export: The Complete Schema Reference
Part 1 of the GA4 + BigQuery series for analytics engineers
The GA4 BigQuery export delivers raw, unsampled, event-level data piped directly into your warehouse. No more sampling warnings. No more “(other)” dimension bucketing. Full SQL access to every click, scroll, and purchase.
But the reality is messier. You’ll encounter nested arrays requiring UNNEST gymnastics, three different traffic source structures that populate under different conditions, undocumented schema changes that break production queries, and subtle gotchas that Google’s documentation glosses over entirely.
This reference is the practical companion to the official docs. It’s the field guide I wish existed when I first opened an events_ table and wondered why my session counts were wrong. Whether you’re setting up your first export or debugging a pipeline that’s been running for years, this guide covers what you actually need to know.
What Gets Exported
When you link a GA4 property to BigQuery, a dataset named analytics_<property_id> appears in your project. Inside, you’ll find date-sharded tables (not partitioned tables) that capture every event flowing through your property.
The Four Table Types
| Table | Purpose | Key Limitation |
|---|---|---|
events_YYYYMMDD | Daily export, your source of truth | 10+ hour latency |
events_intraday_YYYYMMDD | Streaming near-real-time data | Missing attribution fields |
pseudonymous_users_YYYYMMDD | User export keyed by device ID | Optional, must enable separately |
users_YYYYMMDD | User export keyed by custom user_id | Optional, must enable separately |
The daily tables (events_YYYYMMDD) contain fully processed data with complete user attribution. They typically arrive mid-afternoon in your property’s timezone, though there’s no guaranteed SLA for standard properties. These tables continue updating for up to 72 hours to capture late-arriving events (mobile app data and Measurement Protocol hits often arrive delayed).
The streaming tables (events_intraday_YYYYMMDD) provide data within minutes of event occurrence. They’re deleted automatically once the corresponding daily table completes. Critical limitation: the traffic_source, user_ltv, and is_active_user fields are never populated in intraday tables.
The user tables export one row per user identifier with audience memberships, lifetime value metrics, and predictive scores. The pseudonymous_users_ table keys on user_pseudo_id (device ID), while users_ keys on your custom user_id. Both require explicit enablement in the BigQuery linking settings.
Daily vs Streaming: When to Use Which
timeline title GA4 Export Timeline (Day N) section Day N Events occur : User activity throughout the day Intraday available : Within minutes of each event section Day N+1 Daily export completes : ~10-16 hours after midnight Intraday table deleted : After daily export finalizes section Day N+3 Late data window closes : 72 hours for stragglersThe choice isn’t either/or. Enable both exports and use them for different purposes.
Daily exports are your source of truth for reporting. They include complete user attribution and guaranteed data completeness (after the 72-hour window). The tradeoff is latency: you won’t see yesterday’s data until mid-afternoon today.
Streaming exports enable same-day monitoring and real-time dashboards. Expect 0.5-2% discrepancy between intraday and final daily data, with some edge cases showing up to 20% differences. Don’t use intraday tables for final reporting.
Cost differs significantly. Daily batch loading is free—you pay only for storage and queries. Streaming ingestion costs $0.05 per GB, roughly 600,000 events. A site with 1 million daily events pays approximately $1.50/month for streaming.
Standard properties face a 1 million events per day limit on daily exports (no limit on streaming). GA4 360 properties can export up to 20 billion events daily.
The Events Table Schema
Every row in the events table represents a single event. This is the fundamental shift from Universal Analytics, where one row represented one session with events nested inside. Understanding this event-centric model is essential.
events_YYYYMMDD (one row = one event)│├── Top-level fields│ ├── event_name, event_timestamp, event_date│ ├── user_pseudo_id, user_id│ └── platform, stream_id│├── Nested RECORDs (single values)│ ├── device.category, device.browser, device.os...│ ├── geo.country, geo.city, geo.region...│ ├── traffic_source.source, .medium, .name│ ├── collected_traffic_source.manual_source...│ └── privacy_info.analytics_storage...│└── REPEATED RECORDs (arrays, require UNNEST) ├── event_params[] ─── key + value.string_value/int_value/double_value ├── user_properties[] ─── key + value + set_timestamp_micros └── items[] ─── item_id, item_name, price, quantity... └── items.item_params[] (nested array)Core Event Fields
The foundational fields appear at the top level of every row:
| Field | Type | Description |
|---|---|---|
event_name | STRING | Event name (page_view, purchase, custom events) |
event_timestamp | INTEGER | UTC time in microseconds since Unix epoch |
event_date | STRING | Date in property timezone (YYYYMMDD format) |
event_value_in_usd | FLOAT | Currency-converted value when event includes value param |
event_bundle_sequence_id | INTEGER | Sequence within the upload bundle |
Added in July 2024, three batch-ordering fields solve a longstanding problem: events batched together previously shared identical timestamps with no way to determine their original sequence.
| Field | Purpose |
|---|---|
batch_event_index | Order within the batch |
batch_page_id | Groups events from the same page |
batch_ordering_id | Unique ordering identifier |
The Nested event_params Structure
Most of your data lives here. The event_params field is a REPEATED RECORD, meaning it’s an array of key-value pairs attached to each event.
Each element contains a key (STRING) and a value RECORD with four possible fields:
| Value Field | Type | Used For |
|---|---|---|
string_value | STRING | Text values, URLs, categories |
int_value | INTEGER | Counts, IDs, boolean flags (0/1) |
float_value | FLOAT | Currently unused by GA4 |
double_value | FLOAT | Decimal values, prices |
GA4 auto-detects which field to use. The ga_session_id lands in int_value. The page_location lands in string_value. Custom parameters may land anywhere depending on the values you send.
Common parameters you’ll extract frequently:
| Parameter | Value Field | Description |
|---|---|---|
ga_session_id | int_value | Session identifier (Unix timestamp of session start) |
ga_session_number | int_value | Session count for this user |
page_location | string_value | Full page URL |
page_title | string_value | Page title |
page_referrer | string_value | Previous page URL |
source | string_value | Traffic source |
medium | string_value | Traffic medium |
campaign | string_value | Campaign name |
engaged_session_event | int_value | 1 if session is engaged, 0 otherwise |
To extract a parameter, you’ll use UNNEST with a subquery:
SELECT event_name, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_idFROM `project.analytics_123456789.events_*`WHERE _TABLE_SUFFIX = '20260112'We’ll cover UNNEST patterns in depth in Part 2 of this series.
User Identification
Three fields handle user identity at the top level:
| Field | Description |
|---|---|
user_pseudo_id | Device/browser identifier (the value from the _ga cookie on web, Firebase Instance ID on apps) |
user_id | Your custom identifier, set via gtag('set', 'user_id', '...') or equivalent |
user_first_touch_timestamp | Microsecond timestamp of user’s first interaction with your property |
The user_properties REPEATED RECORD mirrors the event_params structure, storing user-scoped properties set via setUserProperty. Unlike event parameters, user properties persist across events and include a set_timestamp_micros field tracking when each was last updated.
Device, Geography, and Platform
The device RECORD contains hardware and software details:
| Field | Example Values |
|---|---|
device.category | mobile, tablet, desktop |
device.operating_system | Android, iOS, Windows, Macintosh |
device.operating_system_version | 14.0, 10 |
device.browser | Chrome, Safari, Firefox |
device.browser_version | 120.0.6099.109 |
device.mobile_brand_name | Apple, Samsung, Google |
device.mobile_model_name | iPhone, Pixel 7 |
The geo RECORD provides IP-based location:
| Field | Description |
|---|---|
geo.continent | Americas, Europe, Asia |
geo.country | United States, France, Japan |
geo.region | California, Île-de-France |
geo.city | San Francisco, Paris |
geo.metro | DMA region (US only) |
The platform field (STRING) indicates web, iOS, or Android. The stream_id identifies which data stream generated the event.
Traffic Source Fields: Three Structures, Three Purposes
GA4 exports three different traffic source structures, each serving a distinct attribution purpose:
flowchart LR subgraph ts["traffic_source"] ts_when["When: Once per user lifetime"] ts_what["Attribution: First touch"] ts_note["⚠️ Never in intraday tables"] end
subgraph cts["collected_traffic_source"] cts_when["When: Every event with UTMs"] cts_what["Attribution: Event-level"] cts_note["Contains: gclid, dclid, manual_*"] end
subgraph stslc["session_traffic_source_last_click"] stslc_when["When: Per session"] stslc_what["Attribution: Last click"] stslc_note["Added: July 2024"] end
user((User arrives)) --> ts event((Event fires)) --> cts session((Session starts)) --> stslctraffic_source (RECORD): First-touch attribution showing how the user was originally acquired. Values never change after first set. Contains source, medium, name (campaign). Not populated in intraday tables.
collected_traffic_source (RECORD): Event-level UTM parameters and click IDs captured at the moment of the event. Contains manual_source, manual_medium, manual_campaign_name, plus click IDs like gclid, dclid, and srsltid.
session_traffic_source_last_click (RECORD): Added July 2024. Comprehensive last-click attribution with nested records for platform-specific campaign data: google_ads_campaign, sa360_campaign, dv360_campaign, cm360_campaign, and manual_campaign.
When building attribution models, understand which field answers your question. “How did we acquire this user?” → traffic_source. “What campaign was active when this specific event fired?” → collected_traffic_source. “What drove this session?” → session_traffic_source_last_click.
E-commerce Structures
The ecommerce RECORD stores transaction-level data:
| Field | Description |
|---|---|
ecommerce.transaction_id | Order ID |
ecommerce.purchase_revenue | Revenue in local currency |
ecommerce.purchase_revenue_in_usd | Revenue converted to USD |
ecommerce.shipping_value | Shipping cost |
ecommerce.tax_value | Tax amount |
ecommerce.total_item_quantity | Total items in transaction |
The items REPEATED RECORD contains one entry per product:
| Field | Description |
|---|---|
items.item_id | Product SKU |
items.item_name | Product name |
items.item_brand | Brand |
items.item_category through item_category5 | Category hierarchy |
items.price | Unit price |
items.quantity | Quantity |
items.coupon | Applied coupon |
items.item_params | Custom product parameters (REPEATED RECORD) |
The items.item_params field (added October 2023) is itself a nested REPEATED RECORD within items, enabling custom product dimensions.
Privacy and Consent
The privacy_info RECORD captures consent state:
| Field | Values | Description |
|---|---|---|
privacy_info.ads_storage | Yes, No, Unset | Ad cookie consent |
privacy_info.analytics_storage | Yes, No, Unset | Analytics cookie consent |
privacy_info.uses_transient_token | TRUE/FALSE | Using cookieless measurement |
When analytics_storage is denied under Consent Mode, events still collect but user_pseudo_id and ga_session_id are stripped. This creates orphaned rows that can’t be attributed to users or sessions.
Critical Gotchas
The event_params Type Problem
GA4 auto-detects parameter data types, storing values in whichever of the four value fields matches. Custom parameters may land in different fields depending on the values sent. Send “123” as a string once and it goes to string_value; send 123 as a number and it goes to int_value.
When uncertain about a parameter’s type, use COALESCE:
SELECT COALESCE( value.string_value, CAST(value.int_value AS STRING), CAST(value.double_value AS STRING) ) AS param_valueFROM UNNEST(event_params)WHERE key = 'custom_param'session_start Events Are Unreliable
⚠️ Warning: Do not count
session_startevents to calculate sessions.
The session_start event is evaluated client-side, triggered when an event includes the _ss parameter. This causes duplicate session_start events within single sessions and missing events in others. The problem is particularly acute for sub-properties where the session’s first event may occur before filters apply.
GA4’s own interface no longer uses session_start for its sessions metric. It estimates from unique session IDs instead. Do the same:
-- Correct: count distinct session IDsSELECT COUNT(DISTINCT CONCAT( user_pseudo_id, CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING) )) AS sessionsFROM `project.analytics_123456789.events_*`WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260112'ga_session_id Is Not Unique Across Users
The ga_session_id is simply the Unix timestamp of when the session started. Multiple users starting sessions at the same second share identical values. Always concatenate with user_pseudo_id to create truly unique session identifiers.
Timezone Traps
Three different timezone contexts coexist:
| Field | Timezone |
|---|---|
event_timestamp | UTC (microseconds) |
event_date | Property timezone (YYYYMMDD string) |
_TABLE_SUFFIX | Pacific Time |
Converting to your local timezone:
SELECT DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Europe/Paris') AS local_timeFROM `project.analytics_123456789.events_*`Consent Mode Creates Orphaned Data
When users deny analytics_storage consent, Advanced Consent Mode still collects events as “cookieless pings” but strips user_pseudo_id and ga_session_id. These orphaned rows inflate event counts while being impossible to attribute.
If consent is granted later on the same page, GA4 “backstitches” consent to previously denied hits. You’ll see this as user_pseudo_id becoming populated on events that previously lacked it.
BigQuery Data Differs from GA4 UI
Expect discrepancies. BigQuery exports exclude:
- Modeled conversions from Consent Mode
- Google Signals user deduplication
- Behavioral modeling for missing data
The UI uses HyperLogLog++ approximation for large cardinalities while BigQuery provides exact counts. High-cardinality dimensions showing “(other)” in the UI appear with full detail in BigQuery.
The User-Provided Data Trap
⚠️ Critical Warning: Enabling the “User-provided data” feature in GA4 admin permanently disables
user_idexport to BigQuery with no reversal option.
If your pipelines depend on user_id in BigQuery, verify this setting before anyone enables it.
Schema Evolution and UA Comparison
How the Schema Has Changed
GA4’s BigQuery export launched around 2019 under the “App+Web” name, inheriting its structure from Firebase’s export schema. Google provides no official changelog, so practitioners must monitor INFORMATION_SCHEMA queries to catch additions.
Major schema additions by approximate date:
| Timeframe | Addition |
|---|---|
| March 2020 | ecommerce RECORD |
| June 2021 | privacy_info for Consent Mode |
| May 2023 | collected_traffic_source for event-level attribution |
| July 2023 | is_active_user field |
| October 2023 | items.item_params nested RECORD |
| July 2024 | Batch sequencing fields, session_traffic_source_last_click |
New fields are never retroactive. Historical tables won’t contain fields added after their export date.
Key Differences from Universal Analytics
If you’re migrating from UA, the mental model shift is significant:
| Aspect | Universal Analytics | GA4 |
|---|---|---|
| Core unit | Session | Event |
| Table name | ga_sessions_YYYYMMDD | events_YYYYMMDD |
| Row represents | One session | One event |
| Session ID | visitId (direct field) | ga_session_id (requires UNNEST) |
| User ID | fullVisitorId | user_pseudo_id |
| Page data | hits.page.pagePath | page_location in event_params |
| Custom dimensions | Indexed array (dimension1, dimension2) | Key-value pairs |
| Availability | GA360 only | All properties |
What’s gone entirely: hits.isExit, pre-calculated session metrics (totals.bounces, totals.pageviews), session-scoped custom dimensions (GA4 supports only user and event scope).
Migration reality: UA and GA4 data cannot be directly combined. The fundamental model difference means historical UA data must remain in separate tables. Any unified analysis requires custom transformation layers that map concepts between the two schemas.
Querying Efficiently
Always Filter _TABLE_SUFFIX
GA4 uses date-sharded tables, not partitioned tables. Without a _TABLE_SUFFIX filter, you’ll scan your entire dataset, potentially years of data.
-- Query last 30 daysSELECT event_name, COUNT(*) AS eventsFROM `project.analytics_123456789.events_*`WHERE _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)) AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))GROUP BY event_nameInline UNNEST vs CROSS JOIN UNNEST
Two patterns for working with nested data:
Inline UNNEST extracts specific parameters while maintaining one row per event. Use this for most queries:
SELECT event_name, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_locationFROM `project.analytics_123456789.events_*`CROSS JOIN UNNEST expands rows, creating one row per parameter. Use for exploration:
SELECT ep.key, COUNT(*) AS occurrencesFROM `project.analytics_123456789.events_*`, UNNEST(event_params) AS epGROUP BY ep.keyORDER BY occurrences DESCReusable Helper Macros
Reduce repetition with dbt macros:
-- macros/ga4_param_string.sql{% macro ga4_param_string(column, param_name) %}(SELECT value.string_value FROM UNNEST({{ column }}) WHERE key = '{{ param_name }}'){% endmacro %}
-- macros/ga4_param_int.sql{% macro ga4_param_int(column, param_name) %}(SELECT value.int_value FROM UNNEST({{ column }}) WHERE key = '{{ param_name }}'){% endmacro %}Usage in a dbt model:
SELECT {{ ga4_param_int('event_params', 'ga_session_id') }} AS session_id, {{ ga4_param_string('event_params', 'page_location') }} AS page_locationFROM {{ source('ga4', 'events') }}WHERE _TABLE_SUFFIX = '20260112'Cost Control Checklist
- Select only needed columns: avoid
SELECT *on tables with nested structures - Run dry runs first: estimate cost before executing large queries
- Materialize frequent aggregations: scheduled queries creating summary tables
- Use
APPROX_COUNT_DISTINCT: when exact precision isn’t required for user counts
Key Takeaways
The GA4 BigQuery export provides raw analytics data access, but its event-centric model and nested structures require a deliberate approach:
- Enable both daily and streaming exports. Use daily as your source of truth, streaming for same-day monitoring.
- Master the inline UNNEST pattern. You’ll use it constantly for parameter extraction.
- Always combine
user_pseudo_id+ga_session_idfor session analysis. Neither field is unique on its own. - Never count
session_startevents for session metrics. Count distinct session IDs instead. - Treat the schema as a moving target. Monitor for field additions, test queries after major GA4 releases.
Next in This Series
Now that you understand the schema, the real work begins: flattening those nested structures efficiently. Part 2: Unnesting GA4 Events covers patterns for every use case, from simple parameter extraction to complex multi-level unnesting for e-commerce items with custom dimensions.