Adrienne Vermorel
Your First GA4 dbt Models: From Raw Events to Sessions
GA4’s BigQuery export looks nothing like Universal Analytics. Instead of one row per session, you get one row per event. Instead of clean columns, you get nested arrays. Instead of a session table, you get nothing. Sessions don’t exist in the raw data; you construct them yourself.
Most tutorials teach you to aggregate to sessions immediately. That works for basic dashboards, but it kills your flexibility. Want to analyze the event sequence leading to purchase? Can’t do it with session-level data. Need to experiment with different attribution windows? Too late, the aggregation already happened.
This tutorial takes a different approach. We’ll build a wide intermediate model that enriches every event with session context (conversion flags, timing metrics, attribution data) without losing event-level granularity. Aggregation happens only in the final mart layer, when you actually need it.
By the end, you’ll have a reusable pattern that works for any GA4 property.
Prerequisites
Before we start, you’ll need:
- A dbt project connected to BigQuery
- GA4 BigQuery export enabled with at least a few days of data
- Basic SQL comfort with CTEs, JOINs, and window functions
No GA4 export yet? Google provides a public demo dataset that works perfectly for this tutorial:
bigquery-public-data.ga4_obfuscated_sample_ecommerceEverything we build works against it.
Understanding the GA4 Schema
One row = one event (not one session)
This is the fundamental shift from Universal Analytics. Each row represents a single event: a page view, a button click, a purchase. A user’s session might generate dozens of rows.
Here’s what a few rows look like:
| event_date | event_name | event_timestamp | user_pseudo_id |
|---|---|---|---|
| 20240115 | page_view | 1705312845123456 | 1234567.8901234 |
| 20240115 | scroll | 1705312847234567 | 1234567.8901234 |
| 20240115 | purchase | 1705312892345678 | 1234567.8901234 |
The implication: you construct sessions from events. They’re not handed to you.
The nested event_params array
This is where GA4 stores most of the useful data, and it’s the structure that trips up most newcomers.
event_params is a repeated record (ARRAY of STRUCTs in BigQuery terms). Each element has a key (string) and a value (a nested struct with four type fields):
| event_params.key | .value.string_value | .value.int_value | .value.double_value |
|---|---|---|---|
| page_location | https://example.com/products | ||
| ga_session_id | 1705312845 | ||
| engagement_time_msec | 5234 |
Only one type field is populated per parameter; the others are NULL. This means you need to know which type to extract. Key parameters to know:
ga_session_id: int_value (we’ll discuss this more shortly)ga_session_number: int_value (sequential session count per user)page_location: string_valuepage_title: string_valueengagement_time_msec: int_value
Note: float_value is essentially unused by GA4. For decimal numbers, look at double_value.
Table naming: daily vs intraday
GA4 creates two types of tables in your BigQuery dataset (analytics_<property_id>):
events_YYYYMMDD: Daily export tables. These can update for up to 72 hours as late-arriving events come in.events_intraday_YYYYMMDD: Streaming tables for real-time data. Deleted automatically when the daily table finalizes.
For this tutorial, we’ll use daily tables only. They have complete data and include traffic source information that intraday tables lack.
Timestamps and timezones
Here’s a subtle trap. GA4 gives you two time fields:
event_timestamp: Microseconds since Unix epoch, always in UTCevent_date: A string in your GA4 property’s configured timezone
If you compare these without conversion, you’ll get hours of drift. For a property set to Paris time, an event at midnight local time shows as 20240115 in event_date but corresponds to 23:00 on January 14th in UTC.
To convert:
-- Microseconds to timestampTIMESTAMP_MICROS(event_timestamp) AS ga4__event__timestamp
-- Convert to local timezoneDATETIME(TIMESTAMP_MICROS(event_timestamp), 'Europe/Paris') AS ga4__event__timestamp_localThe Session ID Trap
Why ga_session_id alone is dangerous
This is the most common mistake I see. It seems logical: GA4 provides a ga_session_id parameter, so use it to identify sessions. But there’s a problem.
ga_session_id is actually a Unix timestamp (the time when the session started). On a busy site, multiple users can start sessions in the same second and get assigned the same session ID.
Let’s prove it. Run this query against your GA4 data:
SELECT (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id, COUNT(DISTINCT user_pseudo_id) AS user_countFROM `your-project.analytics_123456789.events_*`WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260131'GROUP BY 1HAVING COUNT(DISTINCT user_pseudo_id) > 1ORDER BY user_count DESCLIMIT 10On any site with reasonable traffic, you’ll find session IDs shared by multiple users. Using ga_session_id alone for session analysis means you’re accidentally merging different users’ sessions.
Building the correct session key
Combine user_pseudo_id (which is unique per browser/device) with ga_session_id:
CONCAT( user_pseudo_id, '-', CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)) AS ga4__event__session_keyThis becomes the spine of everything we build. Every model will use this pattern.
A side note: ga_session_number is useful for “is this a new or returning user?” analysis (session 1 = new user), but it’s not needed for session identification.
The dbt Model Structure
Our three-layer approach
We’ll build three models:
- Base (
base__ga4__events): Extract fields from nested structures, type-cast, create the session key. Still one row per event. - Intermediate (
int__event__sessionized): Enrich each event with session context (conversion flags, timing metrics, attribution). Still one row per event. - Mart (
mrt__reporting__ga4__acquisition_performance): Aggregate to daily × channel grain. One row per day per source/medium.
┌─────────────────┐ ┌─────────────────────────┐ ┌─────────────┐│ Raw GA4 │ │ Intermediate │ │ Mart ││ Events │ │ Events + Session │ │ Daily × ││ │────▶│ Context │────▶│ Channel ││ 1M rows │ │ 1M rows (enriched) │ │ ~500 rows ││ (events) │ │ (events) │ │ (per day) │└─────────────────┘ └─────────────────────────┘ └─────────────┘My take on this
Most GA4 dbt packages (including the popular Velir/dbt-ga4 package) aggregate in the base layer or immediately after. This simplifies things, but you lose capabilities:
- Event-sequence funnels: “What percentage of users who view a product add it to cart within 60 seconds?” Requires event-level data.
- Time-between-events analysis: “What’s the median time from add_to_cart to purchase?” Requires event timestamps.
- Custom attribution experiments: “What if we attributed conversions to the page view 3 events prior instead of last-click?” Requires event-level granularity.
My approach: defer aggregation until you actually need it. The intermediate model is the workhorse, a wide table where every event knows about its session context.
Building the Base Model
Source definition
First, define the source in models/base/ga4/_sources.yml:
version: 2
sources: - name: ga4 database: your-project # or bigquery-public-data for the demo dataset schema: analytics_123456789 # or ga4_obfuscated_sample_ecommerce tables: - name: events identifier: events_* description: GA4 event export tables (date-sharded)The base__ga4__events model
Create models/base/ga4/base__ga4__events.sql:
{{ config( materialized='table', partition_by={ "field": "ga4__event__date", "data_type": "date", "granularity": "day" }, cluster_by=['ga4__event__user_pseudo_id', 'ga4__event__name'], tags=['base', 'ga4']) }}
SELECT -- Session key (the critical identifier) CONCAT( user_pseudo_id, '-', CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING) ) AS ga4__event__session_key,
-- User identifier user_pseudo_id AS ga4__event__user_pseudo_id,
-- Timestamps PARSE_DATE('%Y%m%d', event_date) AS ga4__measured_at, TIMESTAMP_MICROS(event_timestamp) AS ga4__event__timestamp,
-- Event details event_name AS ga4__event__name,
-- Extracted event parameters (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga4__event__ga_session_id, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_number') AS ga4__event__session_number, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS ga4__event__page_location, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_title') AS ga4__event__page_title, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec') AS ga4__event__engagement_time_msec,
-- Ecommerce (for purchase events) ecommerce.transaction_id AS ga4__event__transaction_id, ecommerce.purchase_revenue AS ga4__event__purchase_revenue_usd,
-- Traffic source (user-level first touch) traffic_source.source AS ga4__event__user_source, traffic_source.medium AS ga4__event__user_medium, traffic_source.name AS ga4__event__user_campaign,
-- Session traffic source (last non-direct click) -- Available from July 2024 onwards session_traffic_source_last_click.manual_campaign.source AS ga4__event__session_source, session_traffic_source_last_click.manual_campaign.medium AS ga4__event__session_medium, session_traffic_source_last_click.manual_campaign.campaign_name AS ga4__event__session_campaign,
-- Device and geo device.category AS ga4__event__device_category, device.operating_system AS ga4__event__operating_system, geo.country AS ga4__event__country
FROM {{ source('ga4', 'events') }}WHERE _TABLE_SUFFIX >= '20260101' -- Adjust based on your needsA few things to note:
- Column naming follows
source__entity__field, so every column clearly identifies its origin (ga4__event__page_location). - We create
ga4__event__session_keyimmediately: This is our unique session identifier. - The subquery pattern (
SELECT value.x FROM UNNEST(event_params) WHERE key = 'y') extracts parameters without multiplying rows. - Materialized as
tablewith partitioning for query performance.
Handling the wildcard table
The events_* pattern matches all date-sharded tables. The _TABLE_SUFFIX pseudo-column lets you filter specific dates efficiently; BigQuery only scans the partitions you need.
For incremental models (covered in a future article), you’d filter like this:
{% if is_incremental() %}WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY)){% endif %}The 3-day lookback handles late-arriving events.
Building the Intermediate Model
We’ll add session-level context to every event without aggregating.
The philosophy: enrich, don’t aggregate
The result is a wide table where every event row contains columns like:
event__session_has_purchase(boolean)event__session_has_add_to_cart(boolean)event__time_in_session_seconds(integer)event__session_event_number(integer)
This lets you filter all events from converting sessions, calculate time-to-conversion, or sequence events without joins.
The int__event__sessionized model
Create models/intermediate/event/int__event__sessionized.sql:
{{ config( materialized='table', partition_by={ "field": "ga4__event__date", "data_type": "date", "granularity": "day" }, cluster_by=['ga4__event__session_key', 'ga4__event__user_pseudo_id'], tags=['intermediate', 'event', 'ga4']) }}
-- IMPORT: Bring in base eventsWITH events AS ( SELECT * FROM {{ ref('base__ga4__events') }}),
-- TRANSFORMATION: Calculate session-level flagssession_flags AS ( SELECT ga4__event__session_key, MAX(CASE WHEN ga4__event__name = 'purchase' THEN 1 ELSE 0 END) AS has_purchase, MAX(CASE WHEN ga4__event__name = 'add_to_cart' THEN 1 ELSE 0 END) AS has_add_to_cart, MAX(CASE WHEN ga4__event__name = 'begin_checkout' THEN 1 ELSE 0 END) AS has_begin_checkout, MAX(CASE WHEN ga4__event__name = 'view_item' THEN 1 ELSE 0 END) AS has_view_item FROM events GROUP BY ga4__event__session_key),
-- BUSINESS LOGIC: Add timing and sequencing with window functionsenriched AS ( SELECT e.*,
-- Session flags (as booleans) sf.has_purchase = 1 AS event__session_has_purchase, sf.has_add_to_cart = 1 AS event__session_has_add_to_cart, sf.has_begin_checkout = 1 AS event__session_has_begin_checkout, sf.has_view_item = 1 AS event__session_has_view_item,
-- Session timing FIRST_VALUE(e.ga4__event__timestamp) OVER ( PARTITION BY e.ga4__event__session_key ORDER BY e.ga4__event__timestamp ) AS event__session_start_timestamp,
TIMESTAMP_DIFF( e.ga4__event__timestamp, FIRST_VALUE(e.ga4__event__timestamp) OVER ( PARTITION BY e.ga4__event__session_key ORDER BY e.ga4__event__timestamp ), SECOND ) AS event__time_in_session_seconds,
-- Event sequencing ROW_NUMBER() OVER ( PARTITION BY e.ga4__event__session_key ORDER BY e.ga4__event__timestamp ) AS event__session_event_number,
-- Session event counts (for context) COUNT(*) OVER (PARTITION BY e.ga4__event__session_key) AS event__session_event_count
FROM events e LEFT JOIN session_flags sf ON e.ga4__event__session_key = sf.ga4__event__session_key)
SELECT * FROM enrichedWhat the boolean flags enable
With event__session_has_purchase as a column on every event, you can now write:
-- All page views from sessions that eventually convertedSELECT *FROM {{ ref('int__event__sessionized') }}WHERE ga4__event__name = 'page_view' AND event__session_has_purchaseWithout this pattern, you’d need a subquery or CTE to identify converting sessions first, then join back.
Session timing columns
The window functions calculate:
event__session_start_timestamp: When the session began (first event).event__time_in_session_seconds: How far into the session this event occurred.event__session_event_number: Position in the session (1st, 5th, 20th event).
Use case: “What’s the average time from first page view to add_to_cart?”
SELECT AVG(event__time_in_session_seconds) AS avg_seconds_to_add_to_cartFROM {{ ref('int__event__sessionized') }}WHERE ga4__event__name = 'add_to_cart' AND event__session_event_number > 1 -- Exclude sessions where add_to_cart was first eventSession attribution columns
We already extracted ga4__event__session_source, ga4__event__session_medium, and ga4__event__session_campaign in the base model from the session_traffic_source_last_click field. This gives you last non-direct click attribution at the session level.
For data before July 2024 (when this field was introduced), you’d need a fallback:
COALESCE( session_traffic_source_last_click.manual_campaign.source, collected_traffic_source.manual_source, '(direct)') AS ga4__event__session_sourceA note on attribution models:
traffic_source: User-level first-touch (the source that originally acquired this user).session_traffic_source_last_click: Session-level last non-direct click (matches GA4 UI’s Traffic Acquisition report).collected_traffic_source: Raw UTM parameters with no attribution logic applied.
For most analyses, session_traffic_source_last_click is what you want.
The result
Every event row now has 10+ additional columns providing session context. The model is still event-level (same number of rows as the base) but much more useful for analysis.
Building the Mart Model
Now we aggregate. The intermediate model did the hard work; the mart model is straightforward.
Marts should be light, aggregated to a grain that’s useful for dashboards, not one row per session (which would be nearly as large as the intermediate). For acquisition analysis, we’ll aggregate to daily × source × medium.
mrt__reporting__ga4__acquisition_performance
Create models/mart/reporting/ga4/mrt__reporting__ga4__acquisition_performance.sql:
{{ config( materialized='table', partition_by={ "field": "acquisition__date", "data_type": "date", "granularity": "day" }, cluster_by=['acquisition__source', 'acquisition__medium'], tags=['mart', 'reporting', 'ga4']) }}
-- IMPORT: Bring in sessionized eventsWITH sessionized_events AS ( SELECT * FROM {{ ref('int__event__sessionized') }}),
-- TRANSFORMATION: Deduplicate to session level firstsessions AS ( SELECT ga4__event__session_key, ga4__event__user_pseudo_id, MIN(ga4__event__date) AS session_date, MAX(ga4__event__session_source) AS session_source, MAX(ga4__event__session_medium) AS session_medium, MAX(ga4__event__session_campaign) AS session_campaign, MAX(CAST(event__session_has_purchase AS INT64)) AS has_purchase, MAX(CAST(event__session_has_add_to_cart AS INT64)) AS has_add_to_cart, MAX(CAST(event__session_has_begin_checkout AS INT64)) AS has_begin_checkout, SUM(ga4__event__purchase_revenue_usd) AS session_revenue_usd, COUNT(*) AS event_count FROM sessionized_events GROUP BY ga4__event__session_key, ga4__event__user_pseudo_id),
-- AGGREGATION: To daily × source × medium grainaggregated AS ( SELECT session_date AS acquisition__date, COALESCE(session_source, '(direct)') AS acquisition__source, COALESCE(session_medium, '(none)') AS acquisition__medium, COALESCE(session_campaign, '(not set)') AS acquisition__campaign,
-- Volume metrics COUNT(*) AS acquisition__session_count, COUNT(DISTINCT ga4__event__user_pseudo_id) AS acquisition__user_count,
-- Conversion metrics SUM(has_purchase) AS acquisition__purchase_count, SUM(has_add_to_cart) AS acquisition__add_to_cart_count, SUM(has_begin_checkout) AS acquisition__begin_checkout_count,
-- Revenue SUM(session_revenue_usd) AS acquisition__revenue_usd,
-- Engagement AVG(event_count) AS acquisition__avg_events_per_session
FROM sessions GROUP BY 1, 2, 3, 4),
-- ENRICHMENT: Calculate ratesenriched AS ( SELECT *, SAFE_DIVIDE(acquisition__purchase_count, acquisition__session_count) AS acquisition__conversion_rate, SAFE_DIVIDE(acquisition__revenue_usd, acquisition__session_count) AS acquisition__revenue_per_session_usd, SAFE_DIVIDE(acquisition__revenue_usd, acquisition__purchase_count) AS acquisition__avg_order_value_usd FROM aggregated)
SELECT * FROM enrichedThis mart is dashboard-ready:
- ~500 rows per day instead of 50K sessions, so queries are fast and storage is minimal.
- Pre-calculated rates: Conversion rate, revenue per session, AOV.
- Partitioned by date: Filter to last 30 days and BigQuery only scans 30 partitions.
- Clustered by source/medium: Common filter patterns are optimized.
Why the intermediate layer still matters
The mart gives you aggregate performance metrics. But say you want to find “sessions where the user viewed a product, then added to cart, then purchased, in that order.”
With only the mart: Impossible. Session-level detail is gone.
With int__event__sessionized:
WITH sequenced AS ( SELECT ga4__event__session_key, MAX(CASE WHEN ga4__event__name = 'view_item' THEN event__session_event_number END) AS view_item_position, MAX(CASE WHEN ga4__event__name = 'add_to_cart' THEN event__session_event_number END) AS add_to_cart_position, MAX(CASE WHEN ga4__event__name = 'purchase' THEN event__session_event_number END) AS purchase_position FROM {{ ref('int__event__sessionized') }} WHERE event__session_has_purchase GROUP BY ga4__event__session_key)
SELECT COUNT(*) AS proper_funnel_sessionsFROM sequencedWHERE view_item_position < add_to_cart_position AND add_to_cart_position < purchase_positionThe intermediate model makes event-sequence questions answerable. The mart model makes dashboards fast and cheap. They serve different purposes; use both.
Common Pitfalls Recap
A summary of the traps we’ve covered and avoided:
Using ga_session_id alone
Multiple users can share the same ID. Always combine with user_pseudo_id.
Forgetting timezone conversion
event_timestamp is UTC; event_date uses your property timezone. Comparing without conversion causes hours of drift.
Querying events_intraday_* for historical analysis
Intraday tables are incomplete, may have duplicates, and lack traffic source data. Use daily tables.
Aggregating too early Session-level aggregation loses funnel and journey flexibility. Our intermediate layer preserves event granularity.
Not filtering with _TABLE_SUFFIX
BigQuery charges by bytes scanned. Without partition filtering, you scan everything. Expensive and slow.
Extracting the wrong value type
Using string_value when the data is in int_value returns NULL. Know your parameter types, or use COALESCE across types.
The pattern we built avoids all of these pitfalls.
What’s Next
We built the foundation: event-level data enriched with session context, plus an aggregated session table for dashboards.
The intermediate model opens up analyses that session-aggregated data can’t support:
- Custom funnel analysis with flexible step definitions
- Time-between-events metrics
- Event sequence pattern matching
- Experimental attribution models
Future articles will cover:
- Incremental GA4 models: Handling late-arriving events and managing costs at scale
- GA4 attribution in dbt: Building custom attribution models beyond last-click
Appendix
Key event_params and their value types
| Parameter | Type | Description |
|---|---|---|
| ga_session_id | int_value | Session identifier (timestamp-based) |
| ga_session_number | int_value | Sequential session count per user |
| page_location | string_value | Full URL |
| page_title | string_value | Page title |
| page_referrer | string_value | Previous page URL |
| engagement_time_msec | int_value | Time page was in foreground |
| entrances | int_value | 1 if first pageview of session |
| session_engaged | string_value | ”1” if session was engaged |
Attribution fields by availability date
| Field | Scope | Available From |
|---|---|---|
| traffic_source | User (first touch) | Always |
| collected_traffic_source | Event (raw UTMs) | May 2023 |
| session_traffic_source_last_click | Session (last non-direct) | July 2024 |
| cross_channel_campaign | Session (cross-channel) | October 2024 |