Adrienne Vermorel

Your First GA4 dbt Models: From Raw Events to Sessions

GA4’s BigQuery export looks nothing like Universal Analytics. Instead of one row per session, you get one row per event. Instead of clean columns, you get nested arrays. Instead of a session table, you get nothing. Sessions don’t exist in the raw data; you construct them yourself.

Most tutorials teach you to aggregate to sessions immediately. That works for basic dashboards, but it kills your flexibility. Want to analyze the event sequence leading to purchase? Can’t do it with session-level data. Need to experiment with different attribution windows? Too late, the aggregation already happened.

This tutorial takes a different approach. We’ll build a wide intermediate model that enriches every event with session context (conversion flags, timing metrics, attribution data) without losing event-level granularity. Aggregation happens only in the final mart layer, when you actually need it.

By the end, you’ll have a reusable pattern that works for any GA4 property.


Prerequisites

Before we start, you’ll need:

  • A dbt project connected to BigQuery
  • GA4 BigQuery export enabled with at least a few days of data
  • Basic SQL comfort with CTEs, JOINs, and window functions

No GA4 export yet? Google provides a public demo dataset that works perfectly for this tutorial:

bigquery-public-data.ga4_obfuscated_sample_ecommerce

Everything we build works against it.


Understanding the GA4 Schema

One row = one event (not one session)

This is the fundamental shift from Universal Analytics. Each row represents a single event: a page view, a button click, a purchase. A user’s session might generate dozens of rows.

Here’s what a few rows look like:

event_dateevent_nameevent_timestampuser_pseudo_id
20240115page_view17053128451234561234567.8901234
20240115scroll17053128472345671234567.8901234
20240115purchase17053128923456781234567.8901234

The implication: you construct sessions from events. They’re not handed to you.

The nested event_params array

This is where GA4 stores most of the useful data, and it’s the structure that trips up most newcomers.

event_params is a repeated record (ARRAY of STRUCTs in BigQuery terms). Each element has a key (string) and a value (a nested struct with four type fields):

event_params.key.value.string_value.value.int_value.value.double_value
page_locationhttps://example.com/products
ga_session_id1705312845
engagement_time_msec5234

Only one type field is populated per parameter; the others are NULL. This means you need to know which type to extract. Key parameters to know:

  • ga_session_id: int_value (we’ll discuss this more shortly)
  • ga_session_number: int_value (sequential session count per user)
  • page_location: string_value
  • page_title: string_value
  • engagement_time_msec: int_value

Note: float_value is essentially unused by GA4. For decimal numbers, look at double_value.

Table naming: daily vs intraday

GA4 creates two types of tables in your BigQuery dataset (analytics_<property_id>):

  • events_YYYYMMDD: Daily export tables. These can update for up to 72 hours as late-arriving events come in.
  • events_intraday_YYYYMMDD: Streaming tables for real-time data. Deleted automatically when the daily table finalizes.

For this tutorial, we’ll use daily tables only. They have complete data and include traffic source information that intraday tables lack.

Timestamps and timezones

Here’s a subtle trap. GA4 gives you two time fields:

  • event_timestamp: Microseconds since Unix epoch, always in UTC
  • event_date: A string in your GA4 property’s configured timezone

If you compare these without conversion, you’ll get hours of drift. For a property set to Paris time, an event at midnight local time shows as 20240115 in event_date but corresponds to 23:00 on January 14th in UTC.

To convert:

-- Microseconds to timestamp
TIMESTAMP_MICROS(event_timestamp) AS ga4__event__timestamp
-- Convert to local timezone
DATETIME(TIMESTAMP_MICROS(event_timestamp), 'Europe/Paris') AS ga4__event__timestamp_local

The Session ID Trap

Why ga_session_id alone is dangerous

This is the most common mistake I see. It seems logical: GA4 provides a ga_session_id parameter, so use it to identify sessions. But there’s a problem.

ga_session_id is actually a Unix timestamp (the time when the session started). On a busy site, multiple users can start sessions in the same second and get assigned the same session ID.

Let’s prove it. Run this query against your GA4 data:

SELECT
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id,
COUNT(DISTINCT user_pseudo_id) AS user_count
FROM `your-project.analytics_123456789.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20260101' AND '20260131'
GROUP BY 1
HAVING COUNT(DISTINCT user_pseudo_id) > 1
ORDER BY user_count DESC
LIMIT 10

On any site with reasonable traffic, you’ll find session IDs shared by multiple users. Using ga_session_id alone for session analysis means you’re accidentally merging different users’ sessions.

Building the correct session key

Combine user_pseudo_id (which is unique per browser/device) with ga_session_id:

CONCAT(
user_pseudo_id,
'-',
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)
) AS ga4__event__session_key

This becomes the spine of everything we build. Every model will use this pattern.

A side note: ga_session_number is useful for “is this a new or returning user?” analysis (session 1 = new user), but it’s not needed for session identification.


The dbt Model Structure

Our three-layer approach

We’ll build three models:

  1. Base (base__ga4__events): Extract fields from nested structures, type-cast, create the session key. Still one row per event.
  2. Intermediate (int__event__sessionized): Enrich each event with session context (conversion flags, timing metrics, attribution). Still one row per event.
  3. Mart (mrt__reporting__ga4__acquisition_performance): Aggregate to daily × channel grain. One row per day per source/medium.
┌─────────────────┐ ┌─────────────────────────┐ ┌─────────────┐
│ Raw GA4 │ │ Intermediate │ │ Mart │
│ Events │ │ Events + Session │ │ Daily × │
│ │────▶│ Context │────▶│ Channel │
│ 1M rows │ │ 1M rows (enriched) │ │ ~500 rows │
│ (events) │ │ (events) │ │ (per day) │
└─────────────────┘ └─────────────────────────┘ └─────────────┘

My take on this

Most GA4 dbt packages (including the popular Velir/dbt-ga4 package) aggregate in the base layer or immediately after. This simplifies things, but you lose capabilities:

  • Event-sequence funnels: “What percentage of users who view a product add it to cart within 60 seconds?” Requires event-level data.
  • Time-between-events analysis: “What’s the median time from add_to_cart to purchase?” Requires event timestamps.
  • Custom attribution experiments: “What if we attributed conversions to the page view 3 events prior instead of last-click?” Requires event-level granularity.

My approach: defer aggregation until you actually need it. The intermediate model is the workhorse, a wide table where every event knows about its session context.


Building the Base Model

Source definition

First, define the source in models/base/ga4/_sources.yml:

version: 2
sources:
- name: ga4
database: your-project # or bigquery-public-data for the demo dataset
schema: analytics_123456789 # or ga4_obfuscated_sample_ecommerce
tables:
- name: events
identifier: events_*
description: GA4 event export tables (date-sharded)

The base__ga4__events model

Create models/base/ga4/base__ga4__events.sql:

{{ config(
materialized='table',
partition_by={
"field": "ga4__event__date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['ga4__event__user_pseudo_id', 'ga4__event__name'],
tags=['base', 'ga4']
) }}
SELECT
-- Session key (the critical identifier)
CONCAT(
user_pseudo_id,
'-',
CAST((SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS STRING)
) AS ga4__event__session_key,
-- User identifier
user_pseudo_id AS ga4__event__user_pseudo_id,
-- Timestamps
PARSE_DATE('%Y%m%d', event_date) AS ga4__measured_at,
TIMESTAMP_MICROS(event_timestamp) AS ga4__event__timestamp,
-- Event details
event_name AS ga4__event__name,
-- Extracted event parameters
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga4__event__ga_session_id,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_number') AS ga4__event__session_number,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS ga4__event__page_location,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_title') AS ga4__event__page_title,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec') AS ga4__event__engagement_time_msec,
-- Ecommerce (for purchase events)
ecommerce.transaction_id AS ga4__event__transaction_id,
ecommerce.purchase_revenue AS ga4__event__purchase_revenue_usd,
-- Traffic source (user-level first touch)
traffic_source.source AS ga4__event__user_source,
traffic_source.medium AS ga4__event__user_medium,
traffic_source.name AS ga4__event__user_campaign,
-- Session traffic source (last non-direct click)
-- Available from July 2024 onwards
session_traffic_source_last_click.manual_campaign.source AS ga4__event__session_source,
session_traffic_source_last_click.manual_campaign.medium AS ga4__event__session_medium,
session_traffic_source_last_click.manual_campaign.campaign_name AS ga4__event__session_campaign,
-- Device and geo
device.category AS ga4__event__device_category,
device.operating_system AS ga4__event__operating_system,
geo.country AS ga4__event__country
FROM {{ source('ga4', 'events') }}
WHERE _TABLE_SUFFIX >= '20260101' -- Adjust based on your needs

A few things to note:

  • Column naming follows source__entity__field, so every column clearly identifies its origin (ga4__event__page_location).
  • We create ga4__event__session_key immediately: This is our unique session identifier.
  • The subquery pattern (SELECT value.x FROM UNNEST(event_params) WHERE key = 'y') extracts parameters without multiplying rows.
  • Materialized as table with partitioning for query performance.

Handling the wildcard table

The events_* pattern matches all date-sharded tables. The _TABLE_SUFFIX pseudo-column lets you filter specific dates efficiently; BigQuery only scans the partitions you need.

For incremental models (covered in a future article), you’d filter like this:

{% if is_incremental() %}
WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 3 DAY))
{% endif %}

The 3-day lookback handles late-arriving events.


Building the Intermediate Model

We’ll add session-level context to every event without aggregating.

The philosophy: enrich, don’t aggregate

The result is a wide table where every event row contains columns like:

  • event__session_has_purchase (boolean)
  • event__session_has_add_to_cart (boolean)
  • event__time_in_session_seconds (integer)
  • event__session_event_number (integer)

This lets you filter all events from converting sessions, calculate time-to-conversion, or sequence events without joins.

The int__event__sessionized model

Create models/intermediate/event/int__event__sessionized.sql:

{{ config(
materialized='table',
partition_by={
"field": "ga4__event__date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['ga4__event__session_key', 'ga4__event__user_pseudo_id'],
tags=['intermediate', 'event', 'ga4']
) }}
-- IMPORT: Bring in base events
WITH events AS (
SELECT * FROM {{ ref('base__ga4__events') }}
),
-- TRANSFORMATION: Calculate session-level flags
session_flags AS (
SELECT
ga4__event__session_key,
MAX(CASE WHEN ga4__event__name = 'purchase' THEN 1 ELSE 0 END) AS has_purchase,
MAX(CASE WHEN ga4__event__name = 'add_to_cart' THEN 1 ELSE 0 END) AS has_add_to_cart,
MAX(CASE WHEN ga4__event__name = 'begin_checkout' THEN 1 ELSE 0 END) AS has_begin_checkout,
MAX(CASE WHEN ga4__event__name = 'view_item' THEN 1 ELSE 0 END) AS has_view_item
FROM events
GROUP BY ga4__event__session_key
),
-- BUSINESS LOGIC: Add timing and sequencing with window functions
enriched AS (
SELECT
e.*,
-- Session flags (as booleans)
sf.has_purchase = 1 AS event__session_has_purchase,
sf.has_add_to_cart = 1 AS event__session_has_add_to_cart,
sf.has_begin_checkout = 1 AS event__session_has_begin_checkout,
sf.has_view_item = 1 AS event__session_has_view_item,
-- Session timing
FIRST_VALUE(e.ga4__event__timestamp) OVER (
PARTITION BY e.ga4__event__session_key
ORDER BY e.ga4__event__timestamp
) AS event__session_start_timestamp,
TIMESTAMP_DIFF(
e.ga4__event__timestamp,
FIRST_VALUE(e.ga4__event__timestamp) OVER (
PARTITION BY e.ga4__event__session_key
ORDER BY e.ga4__event__timestamp
),
SECOND
) AS event__time_in_session_seconds,
-- Event sequencing
ROW_NUMBER() OVER (
PARTITION BY e.ga4__event__session_key
ORDER BY e.ga4__event__timestamp
) AS event__session_event_number,
-- Session event counts (for context)
COUNT(*) OVER (PARTITION BY e.ga4__event__session_key) AS event__session_event_count
FROM events e
LEFT JOIN session_flags sf ON e.ga4__event__session_key = sf.ga4__event__session_key
)
SELECT * FROM enriched

What the boolean flags enable

With event__session_has_purchase as a column on every event, you can now write:

-- All page views from sessions that eventually converted
SELECT *
FROM {{ ref('int__event__sessionized') }}
WHERE ga4__event__name = 'page_view'
AND event__session_has_purchase

Without this pattern, you’d need a subquery or CTE to identify converting sessions first, then join back.

Session timing columns

The window functions calculate:

  • event__session_start_timestamp: When the session began (first event).
  • event__time_in_session_seconds: How far into the session this event occurred.
  • event__session_event_number: Position in the session (1st, 5th, 20th event).

Use case: “What’s the average time from first page view to add_to_cart?”

SELECT
AVG(event__time_in_session_seconds) AS avg_seconds_to_add_to_cart
FROM {{ ref('int__event__sessionized') }}
WHERE ga4__event__name = 'add_to_cart'
AND event__session_event_number > 1 -- Exclude sessions where add_to_cart was first event

Session attribution columns

We already extracted ga4__event__session_source, ga4__event__session_medium, and ga4__event__session_campaign in the base model from the session_traffic_source_last_click field. This gives you last non-direct click attribution at the session level.

For data before July 2024 (when this field was introduced), you’d need a fallback:

COALESCE(
session_traffic_source_last_click.manual_campaign.source,
collected_traffic_source.manual_source,
'(direct)'
) AS ga4__event__session_source

A note on attribution models:

  • traffic_source: User-level first-touch (the source that originally acquired this user).
  • session_traffic_source_last_click: Session-level last non-direct click (matches GA4 UI’s Traffic Acquisition report).
  • collected_traffic_source: Raw UTM parameters with no attribution logic applied.

For most analyses, session_traffic_source_last_click is what you want.

The result

Every event row now has 10+ additional columns providing session context. The model is still event-level (same number of rows as the base) but much more useful for analysis.


Building the Mart Model

Now we aggregate. The intermediate model did the hard work; the mart model is straightforward.

Marts should be light, aggregated to a grain that’s useful for dashboards, not one row per session (which would be nearly as large as the intermediate). For acquisition analysis, we’ll aggregate to daily × source × medium.

mrt__reporting__ga4__acquisition_performance

Create models/mart/reporting/ga4/mrt__reporting__ga4__acquisition_performance.sql:

{{ config(
materialized='table',
partition_by={
"field": "acquisition__date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['acquisition__source', 'acquisition__medium'],
tags=['mart', 'reporting', 'ga4']
) }}
-- IMPORT: Bring in sessionized events
WITH sessionized_events AS (
SELECT * FROM {{ ref('int__event__sessionized') }}
),
-- TRANSFORMATION: Deduplicate to session level first
sessions AS (
SELECT
ga4__event__session_key,
ga4__event__user_pseudo_id,
MIN(ga4__event__date) AS session_date,
MAX(ga4__event__session_source) AS session_source,
MAX(ga4__event__session_medium) AS session_medium,
MAX(ga4__event__session_campaign) AS session_campaign,
MAX(CAST(event__session_has_purchase AS INT64)) AS has_purchase,
MAX(CAST(event__session_has_add_to_cart AS INT64)) AS has_add_to_cart,
MAX(CAST(event__session_has_begin_checkout AS INT64)) AS has_begin_checkout,
SUM(ga4__event__purchase_revenue_usd) AS session_revenue_usd,
COUNT(*) AS event_count
FROM sessionized_events
GROUP BY ga4__event__session_key, ga4__event__user_pseudo_id
),
-- AGGREGATION: To daily × source × medium grain
aggregated AS (
SELECT
session_date AS acquisition__date,
COALESCE(session_source, '(direct)') AS acquisition__source,
COALESCE(session_medium, '(none)') AS acquisition__medium,
COALESCE(session_campaign, '(not set)') AS acquisition__campaign,
-- Volume metrics
COUNT(*) AS acquisition__session_count,
COUNT(DISTINCT ga4__event__user_pseudo_id) AS acquisition__user_count,
-- Conversion metrics
SUM(has_purchase) AS acquisition__purchase_count,
SUM(has_add_to_cart) AS acquisition__add_to_cart_count,
SUM(has_begin_checkout) AS acquisition__begin_checkout_count,
-- Revenue
SUM(session_revenue_usd) AS acquisition__revenue_usd,
-- Engagement
AVG(event_count) AS acquisition__avg_events_per_session
FROM sessions
GROUP BY 1, 2, 3, 4
),
-- ENRICHMENT: Calculate rates
enriched AS (
SELECT
*,
SAFE_DIVIDE(acquisition__purchase_count, acquisition__session_count) AS acquisition__conversion_rate,
SAFE_DIVIDE(acquisition__revenue_usd, acquisition__session_count) AS acquisition__revenue_per_session_usd,
SAFE_DIVIDE(acquisition__revenue_usd, acquisition__purchase_count) AS acquisition__avg_order_value_usd
FROM aggregated
)
SELECT * FROM enriched

This mart is dashboard-ready:

  • ~500 rows per day instead of 50K sessions, so queries are fast and storage is minimal.
  • Pre-calculated rates: Conversion rate, revenue per session, AOV.
  • Partitioned by date: Filter to last 30 days and BigQuery only scans 30 partitions.
  • Clustered by source/medium: Common filter patterns are optimized.

Why the intermediate layer still matters

The mart gives you aggregate performance metrics. But say you want to find “sessions where the user viewed a product, then added to cart, then purchased, in that order.”

With only the mart: Impossible. Session-level detail is gone.

With int__event__sessionized:

WITH sequenced AS (
SELECT
ga4__event__session_key,
MAX(CASE WHEN ga4__event__name = 'view_item' THEN event__session_event_number END) AS view_item_position,
MAX(CASE WHEN ga4__event__name = 'add_to_cart' THEN event__session_event_number END) AS add_to_cart_position,
MAX(CASE WHEN ga4__event__name = 'purchase' THEN event__session_event_number END) AS purchase_position
FROM {{ ref('int__event__sessionized') }}
WHERE event__session_has_purchase
GROUP BY ga4__event__session_key
)
SELECT COUNT(*) AS proper_funnel_sessions
FROM sequenced
WHERE view_item_position < add_to_cart_position
AND add_to_cart_position < purchase_position

The intermediate model makes event-sequence questions answerable. The mart model makes dashboards fast and cheap. They serve different purposes; use both.


Common Pitfalls Recap

A summary of the traps we’ve covered and avoided:

Using ga_session_id alone Multiple users can share the same ID. Always combine with user_pseudo_id.

Forgetting timezone conversion event_timestamp is UTC; event_date uses your property timezone. Comparing without conversion causes hours of drift.

Querying events_intraday_* for historical analysis Intraday tables are incomplete, may have duplicates, and lack traffic source data. Use daily tables.

Aggregating too early Session-level aggregation loses funnel and journey flexibility. Our intermediate layer preserves event granularity.

Not filtering with _TABLE_SUFFIX BigQuery charges by bytes scanned. Without partition filtering, you scan everything. Expensive and slow.

Extracting the wrong value type Using string_value when the data is in int_value returns NULL. Know your parameter types, or use COALESCE across types.

The pattern we built avoids all of these pitfalls.


What’s Next

We built the foundation: event-level data enriched with session context, plus an aggregated session table for dashboards.

The intermediate model opens up analyses that session-aggregated data can’t support:

  • Custom funnel analysis with flexible step definitions
  • Time-between-events metrics
  • Event sequence pattern matching
  • Experimental attribution models

Future articles will cover:

  • Incremental GA4 models: Handling late-arriving events and managing costs at scale
  • GA4 attribution in dbt: Building custom attribution models beyond last-click

Appendix

Key event_params and their value types

ParameterTypeDescription
ga_session_idint_valueSession identifier (timestamp-based)
ga_session_numberint_valueSequential session count per user
page_locationstring_valueFull URL
page_titlestring_valuePage title
page_referrerstring_valuePrevious page URL
engagement_time_msecint_valueTime page was in foreground
entrancesint_value1 if first pageview of session
session_engagedstring_value”1” if session was engaged

Attribution fields by availability date

FieldScopeAvailable From
traffic_sourceUser (first touch)Always
collected_traffic_sourceEvent (raw UTMs)May 2023
session_traffic_source_last_clickSession (last non-direct)July 2024
cross_channel_campaignSession (cross-channel)October 2024

Resources