ServicesAboutNotesContact Get in touch →
EN FR
Note

GA4 Events Sessionized Model

The implementation of the wide event-grain intermediate model for GA4 — the CTE structure, window function patterns, and design decisions that make downstream analysis flexible.

Planted
ga4dbtbigquerydata modelinganalyticsincremental processing

The sessionized events model is the workhorse of a GA4 dbt project. It takes the cleaned, typed base events and adds session context to every row — landing page, traffic source, conversion flags, session duration — via window functions. Everything downstream reads from this model.

The philosophy is event-grain sessionization: preserve the one-row-per-event grain while making session-level analysis trivial by pre-computing session context. This note covers the implementation specifics.

Model Configuration

-- models/intermediate/ga4/int__ga4__events_sessionized.sql
{{
config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
"field": "event__date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['user__pseudo_id', 'session__key', 'event__name']
)
}}
{% set lookback_days = var('ga4_static_incremental_days', 3) %}

The same static lookback pattern applies here. insert_overwrite replaces date partitions atomically. The 3-day lookback ensures window functions re-execute over sessions that may span the partition boundary — a session that starts on day N-3 and has events on day N-2 and N-1 needs all its events present for correct session context.

Clustering by user__pseudo_id, session__key, and event__name improves query performance for the most common access patterns: per-user analysis, per-session analysis, and event-type filtering.

CTE Structure

The model uses a five-stage CTE pipeline:

events → with_clean_urls → with_session_metrics → with_channel → final
event_items (joined in final)

Stage 1: Select from Base

WITH events AS (
SELECT
event__key,
session__key,
event__date,
event__timestamp_utc,
user__pseudo_id,
user__id,
session__ga_id,
session__number,
event__name,
page__location,
page__title,
event__source,
event__medium,
event__campaign,
session__source,
session__medium,
session__campaign,
event__engagement_time_msec,
device__category,
geo__country,
transaction__revenue
FROM {{ ref('base__ga4__events') }}
{% if is_incremental() %}
WHERE event__date >= DATE_SUB(CURRENT_DATE(), INTERVAL {{ lookback_days }} DAY)
{% endif %}
)

Explicit column selection rather than SELECT * makes the model’s contract clear. Adding a new column to the base model won’t silently appear in downstream models.

Stage 2: URL Cleaning

with_clean_urls AS (
SELECT
*,
REGEXP_REPLACE(
SPLIT(page__location, '?')[SAFE_OFFSET(0)],
r'#.*$',
''
) AS page__path
FROM events
)

page__path strips query parameters and fragments from page__location. Landing page analysis using full URLs is noisy — the same /blog/post visited via different UTM combinations would appear as separate pages. Using the path alone makes landing page aggregations meaningful.

SPLIT(url, '?')[SAFE_OFFSET(0)] gets everything before the ?. The subsequent REGEXP_REPLACE removes fragments (#section-name). SAFE_OFFSET prevents errors when the URL has no ?.

Stage 3: Session Metrics via Window Functions

with_session_metrics AS (
SELECT
e.*,
-- Navigation
FIRST_VALUE(page__path IGNORE NULLS) OVER w AS session__landing_page,
FIRST_VALUE(page__location IGNORE NULLS) OVER w AS session__landing_page_full,
LAST_VALUE(page__path IGNORE NULLS) OVER w AS session__exit_page,
-- Traffic source (first non-null in session)
FIRST_VALUE(
COALESCE(event__source, session__source) IGNORE NULLS
) OVER w AS session__source_final,
FIRST_VALUE(
COALESCE(event__medium, session__medium) IGNORE NULLS
) OVER w AS session__medium_final,
FIRST_VALUE(
COALESCE(event__campaign, session__campaign) IGNORE NULLS
) OVER w AS session__campaign_final,
-- Engagement counts
COUNT(*) OVER p AS session__events,
COUNT(CASE WHEN event__name = 'page_view' THEN 1 END) OVER p
AS session__pageviews,
SUM(event__engagement_time_msec) OVER p AS session__engagement_time_msec,
-- Session duration
TIMESTAMP_DIFF(
MAX(event__timestamp_utc) OVER p,
MIN(event__timestamp_utc) OVER p,
SECOND
) AS session__duration_seconds,
-- Conversion flags
MAX(CASE WHEN event__name = 'purchase' THEN 1 ELSE 0 END) OVER p
AS session__has_purchase,
MAX(CASE WHEN event__name = 'add_to_cart' THEN 1 ELSE 0 END) OVER p
AS session__has_add_to_cart,
MAX(CASE WHEN event__name = 'begin_checkout' THEN 1 ELSE 0 END) OVER p
AS session__has_checkout,
MAX(CASE WHEN event__name = 'sign_up' THEN 1 ELSE 0 END) OVER p
AS session__has_signup,
MAX(CASE WHEN event__name = 'generate_lead' THEN 1 ELSE 0 END) OVER p
AS session__has_lead,
-- Revenue
SUM(transaction__revenue) OVER p AS session__revenue,
-- Event sequencing
ROW_NUMBER() OVER w AS event__sequence_number,
CASE
WHEN ROW_NUMBER() OVER w = 1 THEN TRUE
ELSE FALSE
END AS event__is_session_start,
CASE
WHEN ROW_NUMBER() OVER (
PARTITION BY session__key ORDER BY event__timestamp_utc DESC
) = 1 THEN TRUE
ELSE FALSE
END AS event__is_session_end,
-- Time between events
TIMESTAMP_DIFF(
event__timestamp_utc,
LAG(event__timestamp_utc) OVER w,
SECOND
) AS event__seconds_since_previous
FROM with_clean_urls e
WINDOW
w AS (
PARTITION BY session__key
ORDER BY event__timestamp_utc
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
),
p AS (PARTITION BY session__key)
)

Two named windows encode the two types of session-scoped computation:

  • w (ordered) — for FIRST_VALUE, LAST_VALUE, ROW_NUMBER, LAG. Requires ordering and the full frame.
  • p (partition only) — for MAX, COUNT, SUM. No order needed; scans the entire partition naturally.

See GA4 Window Function Pitfalls for why the ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING frame is required for LAST_VALUE.

Stage 4: Channel Grouping

with_channel AS (
SELECT
*,
{{ default_channel_grouping('session__source_final', 'session__medium_final') }}
AS session__channel_grouping
FROM with_session_metrics
)

Channel grouping is a separate CTE because the macro generates a multi-line CASE statement. Keeping it isolated makes the model readable and makes it easy to swap in a custom channel grouping macro.

Stage 5: Item Aggregation Join

event_items AS (
SELECT
event__key,
COUNT(*) AS event__items,
SUM(item__revenue) AS event__items_revenue
FROM {{ ref('int__ga4__event_items') }}
{% if is_incremental() %}
WHERE event__date >= DATE_SUB(CURRENT_DATE(), INTERVAL {{ lookback_days }} DAY)
{% endif %}
GROUP BY event__key
),
final AS (
SELECT
e.*,
COALESCE(i.event__items, 0) AS event__items,
COALESCE(i.event__items_revenue, 0) AS event__items_revenue
FROM with_channel e
LEFT JOIN event_items i ON e.event__key = i.event__key
)
SELECT * FROM final

Item data is aggregated to event grain before the join, avoiding row expansion. The COALESCE ensures non-ecommerce events show 0 items rather than null.

Traffic Source Resolution

The COALESCE(event__source, session__source) pattern handles two different locations where GA4 stores source information:

  • event__source — from event_params, present on events tagged with UTM parameters
  • session__source — from collected_traffic_source, GA4’s session-level source

Modern GA4 implementations (July 2024+) use session_traffic_source_last_click for clean session attribution. For compatibility with older data or properties that haven’t been updated, the COALESCE approach picks up whichever source field is populated. See GA4 Traffic Source Fields for the full picture of GA4’s attribution field landscape.

What Downstream Models Get

With int__ga4__events_sessionized as their source, mart models need only simple aggregations:

-- Session mart: just GROUP BY
SELECT
session__key,
ANY_VALUE(session__landing_page) AS session__landing_page,
ANY_VALUE(session__channel_grouping) AS session__channel_grouping,
ANY_VALUE(session__has_purchase) AS session__has_purchase,
COUNT(*) AS session__events
FROM int__ga4__events_sessionized
GROUP BY session__key

Event-level questions that would require joins against separate session tables now work as simple filters:

-- Events in converting sessions
SELECT * FROM int__ga4__events_sessionized
WHERE session__has_purchase = 1
AND event__name != 'purchase'
ORDER BY session__key, event__sequence_number

One intermediate model with pre-computed session context enables flexible event-level and session-level analysis without joins in downstream models.