Adrienne Vermorel

GA4 BigQuery Export: The Complete Schema Reference

Part 1 of the GA4 + BigQuery series for analytics engineers


The GA4 BigQuery export delivers raw, unsampled, event-level data piped directly into your warehouse. No more sampling warnings. No more “(other)” dimension bucketing. Full SQL access to every click, scroll, and purchase.

But the reality is messier. You’ll encounter nested arrays requiring UNNEST gymnastics, three different traffic source structures that populate under different conditions, undocumented schema changes that break production queries, and subtle gotchas that Google’s documentation glosses over entirely.

This reference is the practical companion to the official docs. It’s the field guide I wish existed when I first opened an events_ table and wondered why my session counts were wrong. Whether you’re setting up your first export or debugging a pipeline that’s been running for years, this guide covers what you actually need to know.


What Gets Exported

When you link a GA4 property to BigQuery, a dataset named analytics_<property_id> appears in your project. Inside, you’ll find date-sharded tables (not partitioned tables) that capture every event flowing through your property.

The Four Table Types

TablePurposeKey Limitation
events_YYYYMMDDDaily export, your source of truth10+ hour latency
events_intraday_YYYYMMDDStreaming near-real-time dataMissing attribution fields
pseudonymous_users_YYYYMMDDUser export keyed by device IDOptional, must enable separately
users_YYYYMMDDUser export keyed by custom user_idOptional, must enable separately

The daily tables (events_YYYYMMDD) contain fully processed data with complete user attribution. They typically arrive mid-afternoon in your property’s timezone, though there’s no guaranteed SLA for standard properties. These tables continue updating for up to 72 hours to capture late-arriving events (mobile app data and Measurement Protocol hits often arrive delayed).

The streaming tables (events_intraday_YYYYMMDD) provide data within minutes of event occurrence. They’re deleted automatically once the corresponding daily table completes. Critical limitation: the traffic_source, user_ltv, and is_active_user fields are never populated in intraday tables.

The user tables export one row per user identifier with audience memberships, lifetime value metrics, and predictive scores. The pseudonymous_users_ table keys on user_pseudo_id (device ID), while users_ keys on your custom user_id. Both require explicit enablement in the BigQuery linking settings.

Daily vs Streaming: When to Use Which

timeline
title GA4 Export Timeline (Day N)
section Day N
Events occur : User activity throughout the day
Intraday available : Within minutes of each event
section Day N+1
Daily export completes : ~10-16 hours after midnight
Intraday table deleted : After daily export finalizes
section Day N+3
Late data window closes : 72 hours for stragglers

The choice isn’t either/or. Enable both exports and use them for different purposes.

Daily exports are your source of truth for reporting. They include complete user attribution and guaranteed data completeness (after the 72-hour window). The tradeoff is latency: you won’t see yesterday’s data until mid-afternoon today.

Streaming exports enable same-day monitoring and real-time dashboards. Expect 0.5-2% discrepancy between intraday and final daily data, with some edge cases showing up to 20% differences. Don’t use intraday tables for final reporting.

Cost differs significantly. Daily batch loading is free—you pay only for storage and queries. Streaming ingestion costs $0.05 per GB, roughly 600,000 events. A site with 1 million daily events pays approximately $1.50/month for streaming.

Standard properties face a 1 million events per day limit on daily exports (no limit on streaming). GA4 360 properties can export up to 20 billion events daily.


The Events Table Schema

Every row in the events table represents a single event. This is the fundamental shift from Universal Analytics, where one row represented one session with events nested inside. Understanding this event-centric model is essential.

events_YYYYMMDD (one row = one event)
├── Top-level fields
│ ├── event_name, event_timestamp, event_date
│ ├── user_pseudo_id, user_id
│ └── platform, stream_id
├── Nested RECORDs (single values)
│ ├── device.category, device.browser, device.os...
│ ├── geo.country, geo.city, geo.region...
│ ├── traffic_source.source, .medium, .name
│ ├── collected_traffic_source.manual_source...
│ └── privacy_info.analytics_storage...
└── REPEATED RECORDs (arrays, require UNNEST)
├── event_params[] ─── key + value.string_value/int_value/double_value
├── user_properties[] ─── key + value + set_timestamp_micros
└── items[] ─── item_id, item_name, price, quantity...
└── items.item_params[] (nested array)

Core Event Fields

The foundational fields appear at the top level of every row:

FieldTypeDescription
event_nameSTRINGEvent name (page_view, purchase, custom events)
event_timestampINTEGERUTC time in microseconds since Unix epoch
event_dateSTRINGDate in property timezone (YYYYMMDD format)
event_value_in_usdFLOATCurrency-converted value when event includes value param
event_bundle_sequence_idINTEGERSequence within the upload bundle

Added in July 2024, three batch-ordering fields solve a longstanding problem: events batched together previously shared identical timestamps with no way to determine their original sequence.

FieldPurpose
batch_event_indexOrder within the batch
batch_page_idGroups events from the same page
batch_ordering_idUnique ordering identifier

The Nested event_params Structure

Most of your data lives here. The event_params field is a REPEATED RECORD, meaning it’s an array of key-value pairs attached to each event.

Each element contains a key (STRING) and a value RECORD with four possible fields:

Value FieldTypeUsed For
string_valueSTRINGText values, URLs, categories
int_valueINTEGERCounts, IDs, boolean flags (0/1)
float_valueFLOATCurrently unused by GA4
double_valueFLOATDecimal values, prices

GA4 auto-detects which field to use. The ga_session_id lands in int_value. The page_location lands in string_value. Custom parameters may land anywhere depending on the values you send.

Common parameters you’ll extract frequently:

ParameterValue FieldDescription
ga_session_idint_valueSession identifier (Unix timestamp of session start)
ga_session_numberint_valueSession count for this user
page_locationstring_valueFull page URL
page_titlestring_valuePage title
page_referrerstring_valuePrevious page URL
sourcestring_valueTraffic source
mediumstring_valueTraffic medium
campaignstring_valueCampaign name
engaged_session_eventint_value1 if session is engaged, 0 otherwise

To extract a parameter, you’ll use UNNEST with a subquery:

SELECT
event_name,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id
FROM `project.analytics_123456789.events_*`
WHERE _TABLE_SUFFIX = '20260112'

We’ll cover UNNEST patterns in depth in Part 2 of this series.

User Identification

Three fields handle user identity at the top level:

FieldDescription
user_pseudo_idDevice/browser identifier (the value from the _ga cookie on web, Firebase Instance ID on apps)
user_idYour custom identifier, set via gtag('set', 'user_id', '...') or equivalent
user_first_touch_timestampMicrosecond timestamp of user’s first interaction with your property

The user_properties REPEATED RECORD mirrors the event_params structure, storing user-scoped properties set via setUserProperty. Unlike event parameters, user properties persist across events and include a set_timestamp_micros field tracking when each was last updated.

Device, Geography, and Platform

The device RECORD contains hardware and software details:

FieldExample Values
device.categorymobile, tablet, desktop
device.operating_systemAndroid, iOS, Windows, Macintosh
device.operating_system_version14.0, 10
device.browserChrome, Safari, Firefox
device.browser_version120.0.6099.109
device.mobile_brand_nameApple, Samsung, Google
device.mobile_model_nameiPhone, Pixel 7

The geo RECORD provides IP-based location:

FieldDescription
geo.continentAmericas, Europe, Asia
geo.countryUnited States, France, Japan
geo.regionCalifornia, Île-de-France
geo.citySan Francisco, Paris
geo.metroDMA region (US only)

The platform field (STRING) indicates web, iOS, or Android. The stream_id identifies which data stream generated the event.

Traffic Source Fields: Three Structures, Three Purposes

GA4 exports three different traffic source structures, each serving a distinct attribution purpose:

flowchart LR
subgraph ts["traffic_source"]
ts_when["When: Once per user lifetime"]
ts_what["Attribution: First touch"]
ts_note["⚠️ Never in intraday tables"]
end
subgraph cts["collected_traffic_source"]
cts_when["When: Every event with UTMs"]
cts_what["Attribution: Event-level"]
cts_note["Contains: gclid, dclid, manual_*"]
end
subgraph stslc["session_traffic_source_last_click"]
stslc_when["When: Per session"]
stslc_what["Attribution: Last click"]
stslc_note["Added: July 2024"]
end
user((User arrives)) --> ts
event((Event fires)) --> cts
session((Session starts)) --> stslc

traffic_source (RECORD): First-touch attribution showing how the user was originally acquired. Values never change after first set. Contains source, medium, name (campaign). Not populated in intraday tables.

collected_traffic_source (RECORD): Event-level UTM parameters and click IDs captured at the moment of the event. Contains manual_source, manual_medium, manual_campaign_name, plus click IDs like gclid, dclid, and srsltid.

session_traffic_source_last_click (RECORD): Added July 2024. Comprehensive last-click attribution with nested records for platform-specific campaign data: google_ads_campaign, sa360_campaign, dv360_campaign, cm360_campaign, and manual_campaign.

When building attribution models, understand which field answers your question. “How did we acquire this user?” → traffic_source. “What campaign was active when this specific event fired?” → collected_traffic_source. “What drove this session?” → session_traffic_source_last_click.

E-commerce Structures

The ecommerce RECORD stores transaction-level data:

FieldDescription
ecommerce.transaction_idOrder ID
ecommerce.purchase_revenueRevenue in local currency
ecommerce.purchase_revenue_in_usdRevenue converted to USD
ecommerce.shipping_valueShipping cost
ecommerce.tax_valueTax amount
ecommerce.total_item_quantityTotal items in transaction

The items REPEATED RECORD contains one entry per product:

FieldDescription
items.item_idProduct SKU
items.item_nameProduct name
items.item_brandBrand
items.item_category through item_category5Category hierarchy
items.priceUnit price
items.quantityQuantity
items.couponApplied coupon
items.item_paramsCustom product parameters (REPEATED RECORD)

The items.item_params field (added October 2023) is itself a nested REPEATED RECORD within items, enabling custom product dimensions.

The privacy_info RECORD captures consent state:

FieldValuesDescription
privacy_info.ads_storageYes, No, UnsetAd cookie consent
privacy_info.analytics_storageYes, No, UnsetAnalytics cookie consent
privacy_info.uses_transient_tokenTRUE/FALSEUsing cookieless measurement

When analytics_storage is denied under Consent Mode, events still collect but user_pseudo_id and ga_session_id are stripped. This creates orphaned rows that can’t be attributed to users or sessions.


Critical Gotchas

The event_params Type Problem

GA4 auto-detects parameter data types, storing values in whichever of the four value fields matches. Custom parameters may land in different fields depending on the values sent. Send “123” as a string once and it goes to string_value; send 123 as a number and it goes to int_value.

When uncertain about a parameter’s type, use COALESCE:

SELECT
COALESCE(
value.string_value,
CAST(value.int_value AS STRING),
CAST(value.double_value AS STRING)
) AS param_value
FROM UNNEST(event_params)
WHERE key = 'custom_param'

session_start Events Are Unreliable

⚠️ Warning: Do not count session_start events to calculate sessions.

The session_start event is evaluated client-side, triggered when an event includes the _ss parameter. This causes duplicate session_start events within single sessions and missing events in others. The problem is particularly acute for sub-properties where the session’s first event may occur before filters apply.

GA4’s own interface no longer uses session_start for its sessions metric. It estimates from unique session IDs instead. Do the same:

-- Correct: count distinct session IDs
SELECT
COUNT(DISTINCT CONCAT(
user_pseudo_id,
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)
)) AS sessions
FROM `project.analytics_123456789.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260112'

ga_session_id Is Not Unique Across Users

The ga_session_id is simply the Unix timestamp of when the session started. Multiple users starting sessions at the same second share identical values. Always concatenate with user_pseudo_id to create truly unique session identifiers.

Timezone Traps

Three different timezone contexts coexist:

FieldTimezone
event_timestampUTC (microseconds)
event_dateProperty timezone (YYYYMMDD string)
_TABLE_SUFFIXPacific Time

Converting to your local timezone:

SELECT
DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Europe/Paris') AS local_time
FROM `project.analytics_123456789.events_*`

When users deny analytics_storage consent, Advanced Consent Mode still collects events as “cookieless pings” but strips user_pseudo_id and ga_session_id. These orphaned rows inflate event counts while being impossible to attribute.

If consent is granted later on the same page, GA4 “backstitches” consent to previously denied hits. You’ll see this as user_pseudo_id becoming populated on events that previously lacked it.

BigQuery Data Differs from GA4 UI

Expect discrepancies. BigQuery exports exclude:

  • Modeled conversions from Consent Mode
  • Google Signals user deduplication
  • Behavioral modeling for missing data

The UI uses HyperLogLog++ approximation for large cardinalities while BigQuery provides exact counts. High-cardinality dimensions showing “(other)” in the UI appear with full detail in BigQuery.

The User-Provided Data Trap

⚠️ Critical Warning: Enabling the “User-provided data” feature in GA4 admin permanently disables user_id export to BigQuery with no reversal option.

If your pipelines depend on user_id in BigQuery, verify this setting before anyone enables it.


Schema Evolution and UA Comparison

How the Schema Has Changed

GA4’s BigQuery export launched around 2019 under the “App+Web” name, inheriting its structure from Firebase’s export schema. Google provides no official changelog, so practitioners must monitor INFORMATION_SCHEMA queries to catch additions.

Major schema additions by approximate date:

TimeframeAddition
March 2020ecommerce RECORD
June 2021privacy_info for Consent Mode
May 2023collected_traffic_source for event-level attribution
July 2023is_active_user field
October 2023items.item_params nested RECORD
July 2024Batch sequencing fields, session_traffic_source_last_click

New fields are never retroactive. Historical tables won’t contain fields added after their export date.

Key Differences from Universal Analytics

If you’re migrating from UA, the mental model shift is significant:

AspectUniversal AnalyticsGA4
Core unitSessionEvent
Table namega_sessions_YYYYMMDDevents_YYYYMMDD
Row representsOne sessionOne event
Session IDvisitId (direct field)ga_session_id (requires UNNEST)
User IDfullVisitorIduser_pseudo_id
Page datahits.page.pagePathpage_location in event_params
Custom dimensionsIndexed array (dimension1, dimension2)Key-value pairs
AvailabilityGA360 onlyAll properties

What’s gone entirely: hits.isExit, pre-calculated session metrics (totals.bounces, totals.pageviews), session-scoped custom dimensions (GA4 supports only user and event scope).

Migration reality: UA and GA4 data cannot be directly combined. The fundamental model difference means historical UA data must remain in separate tables. Any unified analysis requires custom transformation layers that map concepts between the two schemas.


Querying Efficiently

Always Filter _TABLE_SUFFIX

GA4 uses date-sharded tables, not partitioned tables. Without a _TABLE_SUFFIX filter, you’ll scan your entire dataset, potentially years of data.

-- Query last 30 days
SELECT
event_name,
COUNT(*) AS events
FROM `project.analytics_123456789.events_*`
WHERE _TABLE_SUFFIX BETWEEN
FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY))
AND FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
GROUP BY event_name

Inline UNNEST vs CROSS JOIN UNNEST

Two patterns for working with nested data:

Inline UNNEST extracts specific parameters while maintaining one row per event. Use this for most queries:

SELECT
event_name,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location
FROM `project.analytics_123456789.events_*`

CROSS JOIN UNNEST expands rows, creating one row per parameter. Use for exploration:

SELECT
ep.key,
COUNT(*) AS occurrences
FROM `project.analytics_123456789.events_*`,
UNNEST(event_params) AS ep
GROUP BY ep.key
ORDER BY occurrences DESC

Reusable Helper Macros

Reduce repetition with dbt macros:

-- macros/ga4_param_string.sql
{% macro ga4_param_string(column, param_name) %}
(SELECT value.string_value FROM UNNEST({{ column }}) WHERE key = '{{ param_name }}')
{% endmacro %}
-- macros/ga4_param_int.sql
{% macro ga4_param_int(column, param_name) %}
(SELECT value.int_value FROM UNNEST({{ column }}) WHERE key = '{{ param_name }}')
{% endmacro %}

Usage in a dbt model:

SELECT
{{ ga4_param_int('event_params', 'ga_session_id') }} AS session_id,
{{ ga4_param_string('event_params', 'page_location') }} AS page_location
FROM {{ source('ga4', 'events') }}
WHERE _TABLE_SUFFIX = '20260112'

Cost Control Checklist

  • Select only needed columns: avoid SELECT * on tables with nested structures
  • Run dry runs first: estimate cost before executing large queries
  • Materialize frequent aggregations: scheduled queries creating summary tables
  • Use APPROX_COUNT_DISTINCT: when exact precision isn’t required for user counts

Key Takeaways

The GA4 BigQuery export provides raw analytics data access, but its event-centric model and nested structures require a deliberate approach:

  1. Enable both daily and streaming exports. Use daily as your source of truth, streaming for same-day monitoring.
  2. Master the inline UNNEST pattern. You’ll use it constantly for parameter extraction.
  3. Always combine user_pseudo_id + ga_session_id for session analysis. Neither field is unique on its own.
  4. Never count session_start events for session metrics. Count distinct session IDs instead.
  5. Treat the schema as a moving target. Monitor for field additions, test queries after major GA4 releases.

Next in This Series

Now that you understand the schema, the real work begins: flattening those nested structures efficiently. Part 2: Unnesting GA4 Events covers patterns for every use case, from simple parameter extraction to complex multi-level unnesting for e-commerce items with custom dimensions.