GA4 is architecturally different from Universal Analytics. UA gave you sessions as the unit of analysis — one row per session with events nested inside. GA4 inverts this: everything is an event, and sessions are something you build afterward from event-level data.
The Event-Centric Model
When GA4 data arrives in BigQuery, the events_YYYYMMDD table has one row per event — every click, scroll, page view, and purchase fires as a distinct row. There are no pre-aggregated session metrics and no hierarchy. Session definitions, engagement calculations, and session tables must be constructed in the transformation layer from raw event data.
Nested Structures: The Schema Reality
GA4 doesn’t flatten everything into columns. It uses nested RECORD fields and repeated arrays to keep related data together and avoid schema explosion. The top-level schema looks clean:
event_nameevent_timestampevent_dateuser_pseudo_iduser_idBut the actual data lives deeper. Most of your analytics value sits in two repeated fields:
event_params is an array of key-value pairs where GA4 stores parameter data: page_location, page_title, ga_session_id, custom dimensions, engagement metrics—essentially everything that isn’t a top-level field. Each element has a key (the parameter name) and a value record with four possible fields: string_value, int_value, float_value, and double_value. GA4 auto-detects which field gets populated based on your data type.
user_properties mirrors this structure but at user scope—properties you set via setUserProperty that persist across sessions. Unlike event parameters, user properties include a set_timestamp_micros so you can track when they were last updated.
items (for e-commerce) is a separate repeated record. Each product in a purchase generates one item row with item_id, item_name, price, quantity, and optionally item_params (a nested repeated field for custom product dimensions added in October 2023).
Device, geography, and traffic source data are nested RECORDs—single values per event but structured hierarchically. device.category tells you desktop/mobile/tablet. geo.country tells you where the user was. traffic_source, collected_traffic_source, and session_traffic_source_last_click each provide different views of attribution.
Why This Structure Exists
The nesting isn’t accidental. GA4 inherits its export schema from Firebase, and the design prioritizes sparse data. Not every event has every parameter. If GA4 flattened all possible parameters into separate columns, you’d have thousands of columns with 99% NULLs. Nested arrays keep the schema manageable and let you query only what you need.
The downside is that most analytical queries require UNNEST patterns. You can’t write SELECT page_location. You write:
SELECT (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_locationFROM events_*This is the cost of the event-based model. But once you internalize the patterns, it becomes mechanical.
The Two Unnesting Approaches
Correlated subqueries (the pattern above) extract specific parameters without expanding rows. An event with 20 parameters still produces one output row. The subquery reaches into the array, finds the key you want, and returns the value. Use this for most queries.
FROM clause UNNEST expands arrays into separate rows—one row per array element. This is useful for discovery, parameter auditing, or when you actually need to analyze the array structure itself. But it creates a row explosion: 15 parameters become 15 rows per event.
The mental model: use correlated subqueries when you know what parameter you want. Use FROM clause UNNEST when you’re exploring or need to work across the array elements.
Session Identity: The Composite Key Problem
The ga_session_id parameter lives in event_params.int_value. It’s the Unix timestamp of when the session started—a practical identifier because it’s present on every event and tied to GA4’s session definition.
But here’s the trap: ga_session_id is not globally unique. Multiple users starting sessions at the same second share identical values. If you use ga_session_id alone to deduplicate, you’ll accidentally merge events from different users into the same “session.”
The fix is mechanical but critical: combine user_pseudo_id + ga_session_id into a composite session key.
CONCAT(user_pseudo_id, '.', CAST( (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)) AS session_keyThis composite key appears throughout session-building logic—window function PARTITION BY clauses, session-level aggregations, and any analysis that depends on grouping events into cohesive sessions.
Differences from Universal Analytics
| UA | GA4 | |
|---|---|---|
| Row grain | One row per session | One row per event |
| Session IDs | Direct fields | Extracted from event_params arrays |
| Metrics | Pre-calculated (totals.pageviews, totals.bounces) | Computed at query time or transformation layer |
| Custom dimensions | Indexed slots (dimension1, dimension2) | Key-value pairs in event_params |
The structural difference affects where metric computation happens: UA exposed pre-built metrics; GA4 exposes raw events and delegates aggregation to the query or transformation layer.
When to Materialize vs When to Query Live
For one-off exploration, correlated subqueries are fine. For production reporting, materialization is preferable: parameters are extracted once and stored in an enriched events table with session context already attached. Every downstream query runs against flattened data.
The standard pattern: raw events → intermediate sessionized events table (parameters extracted, session context added via window functions) → final marts for dashboards and reporting. UNNEST complexity is concentrated in one place; downstream models inherit flat, typed columns.
Traffic Source Fields: Three Different Structures
GA4 provides three traffic source records, each serving a different purpose. Understanding which to use prevents attribution errors:
traffic_source: User-level, first-touch attribution. The source that originally acquired the user. Never in intraday tables.collected_traffic_source: Event-level raw collection data with UTM parameters and click IDs. Useful for custom attribution modeling.session_traffic_source_last_click(July 2024 onward): Session-level attribution with GA4’s last-non-direct model applied. Matches the GA4 interface.
For data from July 2024 forward, session_traffic_source_last_click is the recommended field for session attribution. Older data requires building session attribution from collected_traffic_source or traffic_source.
Practical Extraction Example
The pattern for extracting multiple parameters while keeping one row per event:
SELECT event_name, event_timestamp, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_title') AS page_title, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS session_id, user_pseudo_id, device.category, geo.countryFROM `project.analytics_PROPERTY_ID.events_*`WHERE _TABLE_SUFFIX = '20260326'Each parameter extraction is independent — the subquery reaches into the array for a specific key and returns one value. This pattern scans the event_params array multiple times per event, but BigQuery’s nested query optimizer can collapse multiple correlated subqueries into a single pass.
For high-volume queries, the patterns guide covers performance optimization strategies: when to use CROSS JOIN UNNEST vs correlated subqueries, how to avoid Cartesian products when unnesting multiple arrays, and when materialization is worth the storage cost.
Related
- GA4 BigQuery Export: Complete Schema Reference — Every field explained, critical gotchas, schema evolution
- Unnesting GA4 Events: Patterns for Every Use Case — Production-ready SQL patterns, dbt templates, performance optimization
- Building Session Tables from GA4 Event Data — How to add session context to events using window functions