ServicesAboutNotesContact Get in touch →
EN FR
Note

GA4 dbt Unnesting Layer Architecture

How to structure a dbt project for GA4 unnesting — base layer for parameter extraction, intermediate for event-specific models, mart for analytics-ready aggregations.

Planted
ga4dbtbigquerydata modelingdata engineering

The SQL patterns for unnesting GA4 data translate directly into dbt models. A well-structured dbt project separates GA4 concerns into base, intermediate, and mart layers, with each layer handling a specific responsibility. The base layer unnests. The intermediate layer filters and reshapes. The mart layer aggregates.

This separation matters because GA4 unnesting is expensive and repetitive. If every mart model independently reaches into event_params with correlated subqueries, you’re paying the nested scan cost for every downstream table. Concentrate the unnesting in one place, materialize it, and let everything downstream work with flat columns.

Base Layer: The Unnesting Model

The base model does two things: converts sharded tables to a partitioned table, and extracts commonly used parameters from the nested arrays.

-- models/base/ga4/base__ga4_bigquery__events.sql
{{
config(
materialized='incremental',
partition_by={'field': 'event_date', 'data_type': 'date'},
incremental_strategy='insert_overwrite',
cluster_by=['event_name']
)
}}
WITH source AS (
SELECT
event_date,
event_timestamp,
event_name,
event_params,
user_pseudo_id,
user_properties,
traffic_source,
device,
geo,
ecommerce,
items
FROM {{ source('ga4', 'events') }}
WHERE TRUE
{% if is_incremental() %}
AND PARSE_DATE('%Y%m%d', _TABLE_SUFFIX) >= DATE_SUB(
(SELECT MAX(event_date) FROM {{ this }}),
INTERVAL 3 DAY
)
{% else %}
AND _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 90 DAY))
{% endif %}
)
SELECT
-- Primary keys
CONCAT(
user_pseudo_id, '-',
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id'),
'-',
event_timestamp
) AS event__key,
CONCAT(
user_pseudo_id, '-',
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')
) AS session__key,
-- Timestamps
PARSE_DATE('%Y%m%d', event_date) AS event__date,
TIMESTAMP_MICROS(event_timestamp) AS event__timestamp,
-- Event identification
event_name AS event__name,
user_pseudo_id AS user__pseudo_id,
-- Common parameters (unnested)
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS event__ga_session_id,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS event__page_location,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_title') AS event__page_title,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_referrer') AS event__page_referrer,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'engagement_time_msec') AS event__engagement_time_msec,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'session_engaged') AS event__session_engaged,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'entrances') AS event__entrances,
-- Traffic source
traffic_source.source AS event__traffic_source,
traffic_source.medium AS event__traffic_medium,
traffic_source.name AS event__traffic_campaign,
-- Device and geo
device.category AS event__device_category,
device.operating_system AS event__operating_system,
device.web_info.browser AS event__browser,
geo.country AS event__country,
geo.city AS event__city,
-- E-commerce (keep nested for intermediate layer)
ecommerce,
items,
-- Keep original arrays for custom extraction
event_params,
user_properties
FROM source

Several design decisions here deserve explanation:

The event__key includes event_timestamp. GA4 doesn’t provide a unique event ID. The composite of user + session + timestamp is the closest approximation. It’s not perfectly unique — two events from the same user in the same session at the same microsecond produce a collision — but it’s sufficient for most practical purposes.

The 3-day lookback window in the incremental logic re-processes recent partitions to catch GA4’s late-arriving data. GA4 can take up to 72 hours to finalize daily tables. Without the lookback, you’d miss events that arrive after the initial export.

The ecommerce, items, event_params, and user_properties fields are preserved. The base model flattens the most common parameters but keeps the original nested structures for downstream models that need access to less common parameters or the items array. This avoids the need to join back to the raw source.

The Parameter Extraction Macro

The correlated subqueries above are repetitive. A reusable macro makes the base model more readable:

-- macros/unnest_event_param.sql
{% macro unnest_event_param(param_key, value_type='string') %}
(SELECT value.{{ value_type }}_value
FROM UNNEST(event_params)
WHERE key = '{{ param_key }}')
{% endmacro %}

Usage:

SELECT
{{ unnest_event_param('page_location', 'string') }} AS event__page_location,
{{ unnest_event_param('ga_session_id', 'int') }} AS event__ga_session_id,
{{ unnest_event_param('value', 'double') }} AS event__value
FROM {{ source('ga4', 'events') }}

For parameters with uncertain types, use the COALESCE variant.

Intermediate Layer: Event-Specific Models

The intermediate layer breaks out specific event types for easier downstream consumption. Each model filters the base table to a single event type and applies event-specific transformations:

-- models/intermediate/ga4/int__ga4_events__page_views.sql
SELECT
event__key,
session__key,
event__date,
event__timestamp,
user__pseudo_id,
event__page_location,
event__page_title,
event__page_referrer,
REGEXP_REPLACE(
REGEXP_REPLACE(event__page_location, r'^https?://[^/]+', ''),
r'[\?#].*$', ''
) AS event__page_path,
IFNULL(event__entrances, 0) = 1 AS event__is_landing_page,
event__traffic_source,
event__traffic_medium,
event__device_category,
event__country
FROM {{ ref('base__ga4_bigquery__events') }}
WHERE event__name = 'page_view'
-- models/intermediate/ga4/int__ga4_events__purchases.sql
SELECT
event__key,
session__key,
event__date,
event__timestamp,
user__pseudo_id,
ecommerce.transaction_id AS purchase__transaction_id,
ecommerce.purchase_revenue AS purchase__revenue,
ecommerce.total_item_quantity AS purchase__total_item_quantity,
ecommerce.shipping_value AS purchase__shipping_value,
ecommerce.tax_value AS purchase__tax_value,
event__traffic_source,
event__traffic_medium,
event__device_category,
event__country,
items
FROM {{ ref('base__ga4_bigquery__events') }}
WHERE event__name = 'purchase'
AND ecommerce.transaction_id IS NOT NULL

The purchase items model uses the items UNNEST pattern at this layer:

-- models/intermediate/ga4/int__ga4_events__purchase_items.sql
SELECT
e.event__key,
e.session__key,
e.event__date,
e.ecommerce.transaction_id AS purchase__transaction_id,
item.item_id AS item__id,
item.item_name AS item__name,
item.item_brand AS item__brand,
item.item_category AS item__category,
item.item_category2 AS item__category2,
item.item_category3 AS item__category3,
item.price AS item__price,
item.quantity AS item__quantity,
item.item_revenue AS item__revenue
FROM {{ ref('base__ga4_bigquery__events') }} AS e,
UNNEST(items) AS item
WHERE event__name = 'purchase'

This is the one place where FROM clause UNNEST is intentional — the item grain model exists specifically to expand the items array. The CROSS JOIN is safe here because purchase events always have items.

Mart Layer: Analytics-Ready Aggregations

The mart layer aggregates into business-ready tables:

-- models/marts/ga4/mrt__analytics__sessions.sql
{{
config(
materialized='table',
partition_by={'field': 'session__date', 'data_type': 'date'},
cluster_by=['session__traffic_source', 'session__traffic_medium']
)
}}
SELECT
session__key,
ANY_VALUE(user__pseudo_id) AS user__pseudo_id,
MIN(event__date) AS session__date,
MIN(event__timestamp) AS session__started_at,
MAX(event__timestamp) AS session__ended_at,
TIMESTAMP_DIFF(MAX(event__timestamp), MIN(event__timestamp), SECOND) AS session__duration_seconds,
COUNT(*) AS session__events,
COUNTIF(event__name = 'page_view') AS session__pageviews,
SUM(event__engagement_time_msec) / 1000 AS session__engagement_seconds,
MAX(event__session_engaged) = '1' AS session__is_engaged,
COUNTIF(event__name = 'purchase') > 0 AS session__is_converted,
SUM(CASE WHEN event__name = 'purchase' THEN ecommerce.purchase_revenue ELSE 0 END) AS session__revenue,
ANY_VALUE(event__traffic_source) AS session__traffic_source,
ANY_VALUE(event__traffic_medium) AS session__traffic_medium,
ANY_VALUE(event__traffic_campaign) AS session__traffic_campaign,
ANY_VALUE(event__device_category) AS session__device_category,
ANY_VALUE(event__country) AS session__country
FROM {{ ref('base__ga4_bigquery__events') }}
GROUP BY session__key

Notice that the mart references the base model directly, not the intermediate models. The session mart needs all events (not just page views or purchases), so it groups the base layer by session__key. The intermediate models are most useful for event-specific marts (page performance, product performance) or when other marts join to them.

Testing the Unnesting

Add tests that catch common UNNEST errors:

models/base/ga4/schema.yml
version: 2
models:
- name: base__ga4_bigquery__events
columns:
- name: event__key
tests:
- unique
- not_null
- name: session__key
tests:
- not_null

The most insidious UNNEST error in e-commerce is revenue multiplication. A custom test catches it:

-- tests/assert_revenue_not_multiplied.sql
WITH order_revenue AS (
SELECT
purchase__transaction_id,
purchase__revenue
FROM {{ ref('int__ga4_events__purchases') }}
),
item_revenue AS (
SELECT
purchase__transaction_id,
SUM(item__revenue) AS total_item_revenue
FROM {{ ref('int__ga4_events__purchase_items') }}
GROUP BY purchase__transaction_id
)
SELECT
o.purchase__transaction_id,
o.purchase__revenue,
i.total_item_revenue,
ABS(o.purchase__revenue - i.total_item_revenue) AS difference
FROM order_revenue AS o
JOIN item_revenue AS i ON o.purchase__transaction_id = i.purchase__transaction_id
WHERE ABS(o.purchase__revenue - i.total_item_revenue) > 0.01

This test fails if the sum of item-level revenue doesn’t match the order-level revenue — a strong signal that something in the UNNEST logic is duplicating or losing rows.

The Naming Convention

The double-underscore prefix in column names (event__page_location, session__key, purchase__revenue) serves two purposes. It namespaces columns by their grain or domain, making it obvious what each column represents when you see it in a downstream query. And it prevents naming collisions when multiple models are joined — event__date and session__date can coexist without ambiguity.

This convention is a choice, not a requirement. Consistency within the project matters more than the specific scheme; whichever pattern is chosen should be applied uniformly across all GA4 models.