Adrienne Vermorel

GA4 + dbt: A Production-Ready Project Template

You’ve mastered the GA4 BigQuery schema. You can unnest event parameters in your sleep. You’ve built session tables and implemented user stitching. Now comes the real challenge: combining everything into a dbt project that actually works in production.

This article delivers a complete project template centered on one core idea: a wide intermediate table at event-level grain with sessionized columns. Every event row carries its session context (landing page, exit page, whether the session converted, total pageviews). This design enables flexible downstream analysis without repeatedly joining session tables.

We’ll build this from scratch, borrowing patterns from established packages where they make sense. You’ll get incremental processing that handles late-arriving data, tests that catch GA4-specific issues, and documentation that keeps the project maintainable.


What We Can Learn from Existing Packages

Before building custom, let’s examine what’s already out there. The open-source ecosystem has converged on a few approaches worth understanding.

Velir/dbt-ga4: The Community Standard

The Velir/dbt-ga4 package dominates the space with 380+ GitHub stars and active development. Its creator positioned it as a home for communal knowledge around GA4 and BigQuery.

The architecture follows a clean base-to-marts pattern:

models/
├── base/
│ ├── base/base_ga4__events
│ ├── stg_ga4__events
│ ├── stg_ga4__event_page_view
│ └── stg_ga4__sessions_traffic_sources
└── marts/
└── core/
├── dim_ga4__sessions
└── fct_ga4__sessions

Three patterns from Velir worth borrowing:

Variable-driven parameter extraction. Rather than hardcoding which event parameters to extract, Velir uses dbt_project.yml variables:

vars:
ga4:
page_view_custom_parameters:
- name: "clean_event"
value_type: "string_value"

Static incremental lookback. Instead of dynamically querying the destination table for the latest date (which breaks BigQuery optimization), Velir reprocesses the last N days on every run. We’ll adopt this pattern.

Composite session keys. Sessions are identified by user_pseudo_id + ga_session_id, not ga_session_id alone. This prevents collisions across users.

But non-partitioned session models become expensive at scale. The documentation explicitly recommends disabling certain models for high-traffic sites.

Other Notable Packages

Fivetran/dbt_ga4_export targets teams using Fivetran connectors. It supports multiple databases (Snowflake, Redshift, Databricks) but provides only 7 models total, assuming event parameters are flattened during ingestion.

MO Data Consulting’s package takes a different approach: dynamically flattening all event_params into individual columns at compile time. Convenient for exploration, but can create unwieldy tables.

admindio/simple_ga4_dbt offers a streamlined, cost-efficient approach with built-in attribution models and minimal dependencies.

Why Build Custom?

These packages optimize for session-grain marts. Our goal is different: a wide event-level table where every row carries session context. This enables:

  • Funnel analysis without session joins
  • Event-level attribution with session awareness
  • Flexible aggregation at any grain downstream

We’ll build our own structure, borrowing the best patterns from existing work.


Project Architecture

Folder Structure

models/
├── base/
│ └── ga4/
│ ├── _ga4__sources.yml
│ ├── _ga4__models.yml
│ └── base__ga4__events.sql ← Single base model
├── intermediate/
│ └── ga4/
│ ├── _int_ga4__models.yml
│ ├── int__ga4__event_items.sql ← Item-level grain
│ └── int__ga4__events_sessionized.sql ← The main table
└── ga4/
├── _mrt_ga4__models.yml
├── mrt__analytics__sessions.sql
└── mrt__analytics__users.sql
macros/
└── ga4/
├── extract_event_param.sql
├── generate_surrogate_key.sql
└── default_channel_grouping.sql

How This Ties to the Series

Each previous article maps to a component:

ArticleMaps to
1. Schema Referencebase__ga4__events.sql (sharded to partitioned + cleaning)
2. Unnesting PatternsParameter extraction macros in base__ga4__events.sql
3. Session TablesSession columns in int__ga4__events_sessionized.sql
4. User StitchingIdentity resolution feeding mrt__analytics__users.sql

Naming Conventions

We follow dbt’s standard prefixes:

  • base_ (source data, cleaned and typed)
  • int_ (intermediate transformations, not exposed to end users)
  • mrt_ (mart tables for business reporting)

The Base Layer

GA4 exports data as date-sharded tables (events_YYYYMMDD) rather than a single partitioned table. This breaks standard incremental strategies because dynamic WHERE clauses cause full table scans (BigQuery must scan all shards to evaluate the filter).

Our single base model handles everything: sharded-to-partitioned conversion, cleaning, typing, and parameter extraction.

-- models/base/ga4/base__ga4__events.sql
{{
config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
"field": "event__date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['user__pseudo_id', 'session__key', 'event__name']
)
}}
{% set lookback_days = var('ga4_static_incremental_days', 3) %}
WITH source AS (
SELECT
event_date,
event_timestamp,
event_name,
event_params,
event_bundle_sequence_id,
user_id,
user_pseudo_id,
user_properties,
device,
geo,
ecommerce,
items,
collected_traffic_source,
is_active_user
FROM {{ source('ga4', 'events') }}
WHERE _TABLE_SUFFIX >= '{{ var("ga4_start_date", "20230101") }}'
{% if is_incremental() %}
AND _TABLE_SUFFIX >= FORMAT_DATE(
'%Y%m%d',
DATE_SUB(CURRENT_DATE(), INTERVAL {{ lookback_days }} DAY)
)
{% endif %}
-- Exclude intraday tables
AND _TABLE_SUFFIX NOT LIKE '%intraday%'
),
cleaned AS (
SELECT
-- Timestamps
PARSE_DATE('%Y%m%d', event_date) AS event__date,
TIMESTAMP_MICROS(event_timestamp) AS event__timestamp_utc,
event_timestamp AS event__timestamp_raw,
-- User identifiers
user_pseudo_id AS user__pseudo_id,
user_id AS user__id,
-- Session identifiers
{{ extract_event_param('event_params', 'ga_session_id', 'int_value') }}
AS session__ga_id,
{{ extract_event_param('event_params', 'ga_session_number', 'int_value') }}
AS session__number,
-- Event details
event_name AS event__name,
event_bundle_sequence_id AS event__bundle_sequence_id,
-- Page information
{{ extract_event_param('event_params', 'page_location', 'string_value') }}
AS page__location,
{{ extract_event_param('event_params', 'page_title', 'string_value') }}
AS page__title,
{{ extract_event_param('event_params', 'page_referrer', 'string_value') }}
AS page__referrer,
-- Traffic source (event-level)
{{ extract_event_param('event_params', 'source', 'string_value') }}
AS event__source,
{{ extract_event_param('event_params', 'medium', 'string_value') }}
AS event__medium,
{{ extract_event_param('event_params', 'campaign', 'string_value') }}
AS event__campaign,
-- Engagement
{{ extract_event_param('event_params', 'session_engaged', 'string_value') }}
AS session__is_engaged,
{{ extract_event_param('event_params', 'engagement_time_msec', 'int_value') }}
AS event__engagement_time_msec,
{{ extract_event_param('event_params', 'entrances', 'int_value') }}
AS event__is_entrance,
-- Device and geo
device.category AS device__category,
device.mobile_brand_name AS device__brand,
device.operating_system AS device__os,
device.web_info.browser AS device__browser,
geo.country AS geo__country,
geo.city AS geo__city,
-- Traffic source (session-level, from collected_traffic_source)
collected_traffic_source.manual_source AS session__source,
collected_traffic_source.manual_medium AS session__medium,
collected_traffic_source.manual_campaign_name AS session__campaign,
collected_traffic_source.gclid AS session__gclid,
-- Ecommerce
ecommerce.transaction_id AS transaction__id,
ecommerce.purchase_revenue AS transaction__revenue,
-- Keep nested structures for downstream models
event_params AS event__params,
user_properties AS user__properties,
items AS items__array
FROM source
)
SELECT
-- Primary key
{{ generate_surrogate_key(['user__pseudo_id', 'event__timestamp_raw', 'event__name']) }}
AS event__key,
-- Session key (composite)
{{ generate_surrogate_key(['user__pseudo_id', 'session__ga_id']) }}
AS session__key,
-- Timestamps
event__date,
event__timestamp_utc,
event__timestamp_raw,
-- User identifiers
user__pseudo_id,
user__id,
-- Session identifiers
session__ga_id,
session__number,
-- Event details
event__name,
event__bundle_sequence_id,
-- Page information
page__location,
page__title,
page__referrer,
-- Traffic source (event-level)
event__source,
event__medium,
event__campaign,
-- Engagement
session__is_engaged,
event__engagement_time_msec,
event__is_entrance,
-- Device and geo
device__category,
device__brand,
device__os,
device__browser,
geo__country,
geo__city,
-- Traffic source (session-level)
session__source,
session__medium,
session__campaign,
session__gclid,
-- Ecommerce
transaction__id,
transaction__revenue,
-- Nested structures
event__params,
user__properties,
items__array
FROM cleaned

Why 3 days lookback? GA4 can update events for 72+ hours after initial export. Conversions tracked via Google Ads, cross-device stitching updates, and consent-mode adjustments all cause late-arriving data. The static lookback reprocesses recent partitions on every run, ensuring nothing is missed.

Why insert_overwrite? For event data (immutable, date-partitioned), this strategy replaces entire partitions cleanly. It’s faster and cheaper than merge, which scans both source and destination.


Macros

The Parameter Extraction Macro

The extract_event_param macro encapsulates the inline subquery pattern:

-- macros/ga4/extract_event_param.sql
{% macro extract_event_param(params_column, param_key, value_type='string_value') %}
(
SELECT value.{{ value_type }}
FROM UNNEST({{ params_column }})
WHERE key = '{{ param_key }}'
LIMIT 1
)
{% endmacro %}

This approach avoids row multiplication (no Cartesian product from UNNEST) while keeping the SQL readable. For numeric parameters stored in different value fields:

{% macro extract_event_param_numeric(params_column, param_key) %}
(
SELECT COALESCE(
value.int_value,
CAST(value.float_value AS INT64),
CAST(value.double_value AS INT64)
)
FROM UNNEST({{ params_column }})
WHERE key = '{{ param_key }}'
LIMIT 1
)
{% endmacro %}

The Intermediate Layer

Handling the Items Array

Ecommerce items require a separate model at item-level grain:

-- models/intermediate/ga4/int__ga4__event_items.sql
{{
config(
materialized='view'
)
}}
WITH events_with_items AS (
SELECT
event__key,
event__date,
event__timestamp_utc,
user__pseudo_id,
session__key,
event__name,
transaction__id,
items__array
FROM {{ ref('base__ga4__events') }}
WHERE event__name IN ('purchase', 'add_to_cart', 'remove_from_cart',
'view_item', 'begin_checkout', 'add_payment_info')
AND ARRAY_LENGTH(items__array) > 0
)
SELECT
e.event__key,
e.event__date,
e.event__timestamp_utc,
e.user__pseudo_id,
e.session__key,
e.event__name,
e.transaction__id,
-- Item details
item.item_id AS item__id,
item.item_name AS item__name,
item.item_brand AS item__brand,
item.item_category AS item__category,
item.item_category2 AS item__category2,
item.item_category3 AS item__category3,
item.item_variant AS item__variant,
item.price AS item__price,
item.quantity AS item__quantity,
item.coupon AS item__coupon,
item.item_list_name AS item__list_name,
item.item_list_index AS item__list_index,
-- Calculate item revenue
COALESCE(item.price, 0) * COALESCE(item.quantity, 1) AS item__revenue
FROM events_with_items e,
UNNEST(e.items__array) AS item

This model intentionally uses Cartesian UNNEST because we need the item-level grain.

The Wide Sessionized Table

Every event row gets session context attached via window functions.

-- models/intermediate/ga4/int__ga4__events_sessionized.sql
{{
config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
"field": "event__date",
"data_type": "date",
"granularity": "day"
},
cluster_by=['user__pseudo_id', 'session__key', 'event__name']
)
}}
{% set lookback_days = var('ga4_static_incremental_days', 3) %}
WITH events AS (
SELECT
event__key,
session__key,
event__date,
event__timestamp_utc,
user__pseudo_id,
user__id,
session__ga_id,
session__number,
event__name,
page__location,
page__title,
event__source,
event__medium,
event__campaign,
session__source,
session__medium,
session__campaign,
event__engagement_time_msec,
device__category,
geo__country,
transaction__revenue
FROM {{ ref('base__ga4__events') }}
{% if is_incremental() %}
WHERE event__date >= DATE_SUB(CURRENT_DATE(), INTERVAL {{ lookback_days }} DAY)
{% endif %}
),
-- Aggregate items per event (avoid repeated joins)
event_items AS (
SELECT
event__key,
COUNT(*) AS event__items,
SUM(item__revenue) AS event__items_revenue
FROM {{ ref('int__ga4__event_items') }}
{% if is_incremental() %}
WHERE event__date >= DATE_SUB(CURRENT_DATE(), INTERVAL {{ lookback_days }} DAY)
{% endif %}
GROUP BY event__key
),
-- Clean page URLs for landing/exit page analysis
with_clean_urls AS (
SELECT
*,
-- Remove query parameters for cleaner landing/exit pages
REGEXP_REPLACE(
SPLIT(page__location, '?')[SAFE_OFFSET(0)],
r'#.*$',
''
) AS page__path
FROM events
),
-- Add session-level metrics via window functions
with_session_metrics AS (
SELECT
e.*,
-- Landing and exit pages
FIRST_VALUE(page__path IGNORE NULLS) OVER (session_window) AS session__landing_page,
FIRST_VALUE(page__location IGNORE NULLS) OVER (session_window) AS session__landing_page_full,
LAST_VALUE(page__path IGNORE NULLS) OVER (session_window_full) AS session__exit_page,
-- Session traffic source (first non-null source in session)
FIRST_VALUE(
COALESCE(event__source, session__source) IGNORE NULLS
) OVER (session_window) AS session__source_final,
FIRST_VALUE(
COALESCE(event__medium, session__medium) IGNORE NULLS
) OVER (session_window) AS session__medium_final,
FIRST_VALUE(
COALESCE(event__campaign, session__campaign) IGNORE NULLS
) OVER (session_window) AS session__campaign_final,
-- Session engagement metrics
COUNT(*) OVER (PARTITION BY session__key) AS session__events,
COUNT(CASE WHEN event__name = 'page_view' THEN 1 END)
OVER (PARTITION BY session__key) AS session__pageviews,
SUM(event__engagement_time_msec) OVER (PARTITION BY session__key) AS session__engagement_time_msec,
-- Session duration
TIMESTAMP_DIFF(
MAX(event__timestamp_utc) OVER (PARTITION BY session__key),
MIN(event__timestamp_utc) OVER (PARTITION BY session__key),
SECOND
) AS session__duration_seconds,
-- Session conversion flags
MAX(CASE WHEN event__name = 'purchase' THEN 1 ELSE 0 END)
OVER (PARTITION BY session__key) AS session__has_purchase,
MAX(CASE WHEN event__name = 'add_to_cart' THEN 1 ELSE 0 END)
OVER (PARTITION BY session__key) AS session__has_add_to_cart,
MAX(CASE WHEN event__name = 'begin_checkout' THEN 1 ELSE 0 END)
OVER (PARTITION BY session__key) AS session__has_checkout,
MAX(CASE WHEN event__name = 'sign_up' THEN 1 ELSE 0 END)
OVER (PARTITION BY session__key) AS session__has_signup,
MAX(CASE WHEN event__name = 'generate_lead' THEN 1 ELSE 0 END)
OVER (PARTITION BY session__key) AS session__has_lead,
-- Session revenue
SUM(transaction__revenue) OVER (PARTITION BY session__key) AS session__revenue,
-- Event sequencing within session
ROW_NUMBER() OVER (
PARTITION BY session__key
ORDER BY event__timestamp_utc
) AS event__sequence_number,
-- Is this the first/last event in session?
CASE
WHEN ROW_NUMBER() OVER (
PARTITION BY session__key
ORDER BY event__timestamp_utc
) = 1 THEN TRUE
ELSE FALSE
END AS event__is_session_start,
CASE
WHEN ROW_NUMBER() OVER (
PARTITION BY session__key
ORDER BY event__timestamp_utc DESC
) = 1 THEN TRUE
ELSE FALSE
END AS event__is_session_end,
-- Time since previous event (for engagement analysis)
TIMESTAMP_DIFF(
event__timestamp_utc,
LAG(event__timestamp_utc) OVER (
PARTITION BY session__key
ORDER BY event__timestamp_utc
),
SECOND
) AS event__seconds_since_previous
FROM with_clean_urls e
WINDOW
session_window AS (
PARTITION BY session__key
ORDER BY event__timestamp_utc
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
),
session_window_full AS (
PARTITION BY session__key
ORDER BY event__timestamp_utc
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
),
-- Add channel grouping
with_channel AS (
SELECT
*,
{{ default_channel_grouping('session__source_final', 'session__medium_final') }}
AS session__channel_grouping
FROM with_session_metrics
),
-- Join item aggregates
final AS (
SELECT
e.*,
COALESCE(i.event__items, 0) AS event__items,
COALESCE(i.event__items_revenue, 0) AS event__items_revenue
FROM with_channel e
LEFT JOIN event_items i ON e.event__key = i.event__key
)
SELECT * FROM final

Critical Window Function Details

The LAST_VALUE trap. Without explicit framing, LAST_VALUE returns the current row:

-- WRONG: Returns current row's page
LAST_VALUE(page__path) OVER (PARTITION BY session__key ORDER BY event__timestamp_utc)
-- CORRECT: Returns actual last page
LAST_VALUE(page__path) OVER (
PARTITION BY session__key
ORDER BY event__timestamp_utc
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)

IGNORE NULLS matters. GA4 doesn’t track page_location for all events. Using IGNORE NULLS ensures actual page values:

FIRST_VALUE(page__path IGNORE NULLS) OVER (session_window) AS session__landing_page

Boolean flags via MAX. To propagate a flag to all events in a session:

MAX(CASE WHEN event__name = 'purchase' THEN 1 ELSE 0 END)
OVER (PARTITION BY session__key) AS session__has_purchase

The Channel Grouping Macro

Google’s default channel grouping logic, encapsulated in a reusable macro:

-- macros/ga4/default_channel_grouping.sql
{% macro default_channel_grouping(source_column, medium_column) %}
CASE
-- Direct
WHEN {{ source_column }} IS NULL
OR {{ source_column }} = '(direct)'
OR {{ source_column }} = '(none)'
THEN 'Direct'
-- Paid Search
WHEN REGEXP_CONTAINS({{ medium_column }}, r'^(cpc|ppc|paidsearch)$')
OR ({{ source_column }} = 'google' AND {{ medium_column }} = 'sem')
THEN 'Paid Search'
-- Paid Social
WHEN REGEXP_CONTAINS({{ medium_column }}, r'^(paidsocial|paid-social|paid_social)$')
OR (REGEXP_CONTAINS({{ source_column }}, r'facebook|instagram|linkedin|twitter|tiktok')
AND REGEXP_CONTAINS({{ medium_column }}, r'^(cpc|ppc|paid)'))
THEN 'Paid Social'
-- Display
WHEN REGEXP_CONTAINS({{ medium_column }}, r'^(display|cpm|banner)$')
THEN 'Display'
-- Organic Search
WHEN {{ medium_column }} = 'organic'
OR REGEXP_CONTAINS({{ source_column }}, r'google|bing|yahoo|duckduckgo|baidu')
AND {{ medium_column }} = 'organic'
THEN 'Organic Search'
-- Organic Social
WHEN REGEXP_CONTAINS({{ source_column }}, r'facebook|instagram|linkedin|twitter|tiktok|pinterest|youtube')
AND ({{ medium_column }} IS NULL OR {{ medium_column }} IN ('social', 'referral', '(none)'))
THEN 'Organic Social'
-- Email
WHEN REGEXP_CONTAINS({{ medium_column }}, r'^(email|e-mail|e_mail)$')
THEN 'Email'
-- Affiliate
WHEN REGEXP_CONTAINS({{ medium_column }}, r'^(affiliate|affiliates)$')
THEN 'Affiliates'
-- Referral
WHEN {{ medium_column }} = 'referral'
THEN 'Referral'
-- Other
ELSE 'Other'
END
{% endmacro %}

Testing and Data Quality

Standard dbt tests miss GA4-specific issues. This section covers a comprehensive testing strategy.

Source Configuration with Freshness

models/base/ga4/_ga4__sources.yml
version: 2
sources:
- name: ga4
database: "{{ var('ga4_project_id') }}"
schema: "{{ var('ga4_dataset') }}"
description: "GA4 BigQuery Export"
tables:
- name: events
identifier: "events_*"
description: "Date-sharded GA4 events table"
freshness:
warn_after: {count: 24, period: hour}
error_after: {count: 48, period: hour}
loaded_at_field: "TIMESTAMP_MICROS(event_timestamp)"

Base Model Tests

models/base/ga4/_ga4__models.yml
version: 2
models:
- name: base__ga4__events
description: "GA4 events: sharded to partitioned, cleaned, typed, with session keys"
columns:
- name: event__key
description: "Surrogate key for the event"
tests:
- unique
- not_null
- name: user__pseudo_id
description: "Client ID (device identifier)"
tests:
- not_null
- name: event__timestamp_utc
description: "Event timestamp in UTC"
tests:
- not_null
- name: session__key
description: "Composite session identifier"
tests:
- not_null

GA4-Specific Singular Tests

Test for missing session_start events:

-- tests/singular/test_sessions_missing_session_start.sql
-- Sessions should have a session_start event
-- Some sessions may legitimately miss it (ad blockers, consent),
-- but a high percentage indicates tracking issues
WITH session_stats AS (
SELECT
session__key,
COUNT(*) AS session__events,
MAX(CASE WHEN event__name = 'session_start' THEN 1 ELSE 0 END) AS session__has_start
FROM {{ ref('int__ga4__events_sessionized') }}
WHERE event__date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
GROUP BY session__key
)
SELECT
session__key,
session__events,
session__has_start
FROM session_stats
WHERE session__has_start = 0
AND session__events > 3 -- Legitimate sessions with multiple events
LIMIT 100

Test for orphaned transactions:

-- tests/singular/test_purchase_without_session.sql
-- Purchases should have valid session context
SELECT
event__key,
event__date,
session__ga_id,
user__pseudo_id
FROM {{ ref('base__ga4__events') }}
WHERE event__name = 'purchase'
AND transaction__id IS NOT NULL
AND (session__ga_id IS NULL OR user__pseudo_id IS NULL)
AND event__date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
LIMIT 100

Schema Tests for the Wide Table

models/intermediate/ga4/_int_ga4__models.yml
version: 2
models:
- name: int__ga4__events_sessionized
description: "Wide event table with session context on every row"
columns:
- name: event__key
tests:
- unique
- not_null
- name: session__key
tests:
- not_null
- name: session__landing_page
description: "First page path in the session"
tests:
- not_null:
where: "event__name = 'page_view'"
- name: session__pageviews
tests:
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000 # Flag suspicious sessions
- name: session__duration_seconds
tests:
- dbt_utils.accepted_range:
min_value: 0
max_value: 86400 # 24 hours max
- name: session__channel_grouping
tests:
- accepted_values:
values: ['Direct', 'Organic Search', 'Paid Search', 'Paid Social',
'Organic Social', 'Email', 'Referral', 'Display',
'Affiliates', 'Other']

Documentation and Lineage

Good documentation makes the project maintainable.

Source Documentation

models/base/ga4/_ga4__sources.yml
version: 2
sources:
- name: ga4
description: |
GA4 BigQuery Export tables.
See [Google's schema documentation](https://support.google.com/analytics/answer/7029846)
for the complete field reference.
**Important notes:**
- Tables are date-sharded as `events_YYYYMMDD`
- Intraday tables (`events_intraday_YYYYMMDD`) are excluded from processing
- Data can arrive up to 72 hours late
tables:
- name: events
description: |
Raw GA4 event data exported to BigQuery.
Each row represents a single event (page_view, purchase, custom events, etc.).
Event parameters are stored in nested RECORD fields that require unnesting.

Model Documentation

models/intermediate/ga4/_int_ga4__models.yml
version: 2
models:
- name: int__ga4__events_sessionized
description: |
Wide event table with session context attached to every row.
**Grain:** One row per event
**Key design decisions:**
- Session metrics (session__landing_page, session__exit_page, session__has_purchase) are
computed via window functions and attached to every event in the session
- Traffic source uses first non-null source in session (matches GA4 behavior)
- Channel grouping follows Google's default rules
**Usage:**
- Use for event-level analysis with session context
- Aggregate to session grain using `WHERE event__is_session_start = TRUE`
- Join to `mrt__analytics__users` for user-level analysis
columns:
- name: session__key
description: |
Composite key of user__pseudo_id + session__ga_id.
**Why not just session__ga_id?**
Session IDs are only unique within a user_pseudo_id. Different users can have
the same ga_session_id value.
- name: session__landing_page
description: |
First page path (without query parameters) viewed in the session.
Computed using `FIRST_VALUE(page__path IGNORE NULLS)` to handle events that
don't have page information.
- name: session__has_purchase
description: |
1 if any event in this session was a purchase, 0 otherwise.
This flag is on every row in the session, enabling funnel analysis without joins.

The Project README

Create a README at the project root:

# GA4 dbt Project
This project transforms raw GA4 BigQuery exports into analytics-ready tables.
## Quick Start
1. Configure your variables in `dbt_project.yml`:
```yaml
vars:
ga4_project_id: "your-gcp-project"
ga4_dataset: "analytics_123456789"
ga4_start_date: "20230101"
ga4_static_incremental_days: 3
  1. Run the project:

    Terminal window
    # Initial full load
    dbt build --full-refresh
    # Daily incremental
    dbt build

Key Models

ModelGrainDescription
int__ga4__events_sessionizedEventEvery event with session context
mrt__analytics__sessionsSessionOne row per session
mrt__analytics__usersUserOne row per user with stitched identity

Architecture

base__ga4__events (sharded → partitioned, cleaned, typed)
int__ga4__events_sessionized (sessionized)
mrt__analytics__sessions / mrt__analytics__users (marts)

Known Limitations

  • Data latency: GA4 data can arrive up to 72 hours late. We reprocess the last 3 days on every run to capture late-arriving events.

  • UI discrepancies: Expect 1-2% variance compared to GA4 reports. The UI uses sampled data and different aggregation methods.

  • Session attribution: We use the first non-null source in the session for attribution, which matches GA4’s session-scoped traffic source behavior.

### Lineage Visualization
Generate dbt docs for visual lineage:
```bash
dbt docs generate
dbt docs serve

For a simple Mermaid diagram in your README:

graph LR
A[events_*] --> B[base__ga4__events]
B --> C[int__ga4__event_items]
B --> D[int__ga4__events_sessionized]
C --> D
D --> E[mrt__analytics__sessions]
D --> F[mrt__analytics__users]

Putting It All Together

The Complete dbt_project.yml

name: 'ga4_analytics'
version: '1.0.0'
profile: 'ga4_analytics'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
target-path: "target"
clean-targets:
- "target"
- "dbt_packages"
vars:
# GCP Configuration
ga4_project_id: "{{ env_var('GA4_PROJECT_ID') }}"
ga4_dataset: "{{ env_var('GA4_DATASET', 'analytics_123456789') }}"
# Processing Configuration
ga4_start_date: "20230101"
ga4_static_incremental_days: 3
# Business Configuration
ga4_conversion_events:
- 'purchase'
- 'sign_up'
- 'generate_lead'
- 'contact_form_submit'
# URL Cleaning
ga4_query_params_to_remove:
- 'gclid'
- 'fbclid'
- 'utm_id'
- '_ga'
models:
ga4_analytics:
base:
+schema: base
+materialized: view
+tags: ['ga4', 'base']
ga4:
+materialized: incremental
intermediate:
+schema: intermediate
+materialized: incremental
+tags: ['ga4', 'intermediate']
marts:
+schema: marts
+materialized: table
+tags: ['ga4', 'marts']
tests:
+severity: warn
+store_failures: true

Running the Project

Initial setup (full refresh):

Terminal window
# Test connection
dbt debug
# Run full refresh to backfill all data
dbt build --full-refresh
# Generate documentation
dbt docs generate

Daily operations:

Terminal window
# Standard incremental run
dbt build
# Run with specific date range (for backfills)
dbt build --vars '{"ga4_start_date": "20240101"}'

Recommended schedule: Run daily, 4-6 hours after midnight in your primary timezone. This allows time for GA4’s daily export to complete and most late-arriving data to land.


The Mart Layer

With the wide intermediate table complete, building marts is straightforward.

Session Mart

-- models/marts/ga4/mrt__analytics__sessions.sql
{{
config(
materialized='table',
partition_by={
"field": "event__date",
"data_type": "date"
},
cluster_by=['session__channel_grouping', 'device__category']
)
}}
SELECT
session__key,
user__pseudo_id,
session__ga_id,
session__number,
-- Timestamps
event__date,
MIN(event__timestamp_utc) AS session__started_at,
MAX(event__timestamp_utc) AS session__ended_at,
-- Session attributes (take from any row, they're all the same)
ANY_VALUE(session__landing_page) AS session__landing_page,
ANY_VALUE(session__exit_page) AS session__exit_page,
ANY_VALUE(session__source_final) AS session__source,
ANY_VALUE(session__medium_final) AS session__medium,
ANY_VALUE(session__campaign_final) AS session__campaign,
ANY_VALUE(session__channel_grouping) AS session__channel_grouping,
ANY_VALUE(device__category) AS device__category,
ANY_VALUE(geo__country) AS geo__country,
-- Engagement metrics
ANY_VALUE(session__pageviews) AS session__pageviews,
ANY_VALUE(session__events) AS session__events,
ANY_VALUE(session__duration_seconds) AS session__duration_seconds,
ANY_VALUE(session__engagement_time_msec) / 1000 AS session__engagement_time_seconds,
-- Conversion flags
ANY_VALUE(session__has_purchase) AS session__has_purchase,
ANY_VALUE(session__has_add_to_cart) AS session__has_add_to_cart,
ANY_VALUE(session__has_checkout) AS session__has_checkout,
ANY_VALUE(session__has_signup) AS session__has_signup,
-- Revenue
ANY_VALUE(session__revenue) AS session__revenue
FROM {{ ref('int__ga4__events_sessionized') }}
GROUP BY 1, 2, 3, 4, 5

User Mart

This model incorporates user stitching from Article 4:

-- models/marts/ga4/mrt__analytics__users.sql
{{
config(
materialized='table',
cluster_by=['user__first_seen_at']
)
}}
WITH user_sessions AS (
SELECT
user__pseudo_id,
user__id,
session__key,
event__date,
session__started_at,
session__channel_grouping,
device__category,
geo__country,
session__has_purchase,
session__revenue
FROM {{ ref('mrt__analytics__sessions') }}
),
user_first_last AS (
SELECT
user__pseudo_id,
-- First seen
MIN(event__date) AS user__first_seen_at,
FIRST_VALUE(session__channel_grouping) OVER (
PARTITION BY user__pseudo_id
ORDER BY event__date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS user__first_channel,
FIRST_VALUE(device__category) OVER (
PARTITION BY user__pseudo_id
ORDER BY event__date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS user__first_device,
-- Last seen
MAX(event__date) AS user__last_seen_at,
LAST_VALUE(session__channel_grouping) OVER (
PARTITION BY user__pseudo_id
ORDER BY event__date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS user__last_channel,
-- Aggregates
COUNT(DISTINCT session__key) AS user__sessions,
MAX(session__has_purchase) AS user__has_purchased,
SUM(session__revenue) AS user__lifetime_revenue
FROM user_sessions
GROUP BY user__pseudo_id
),
-- Stitch user_id (from Article 4)
user_identity AS (
SELECT DISTINCT
user__pseudo_id,
FIRST_VALUE(user__id IGNORE NULLS) OVER (
PARTITION BY user__pseudo_id
ORDER BY event__date DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS user__stitched_id
FROM user_sessions
WHERE user__id IS NOT NULL
)
SELECT
u.user__pseudo_id,
i.user__stitched_id AS user__id,
u.user__first_seen_at,
u.user__last_seen_at,
u.user__first_channel,
u.user__first_device,
u.user__last_channel,
u.user__sessions,
u.user__has_purchased,
u.user__lifetime_revenue,
-- Derived metrics
DATE_DIFF(u.user__last_seen_at, u.user__first_seen_at, DAY) AS user__lifespan_days,
DATE_DIFF(CURRENT_DATE(), u.user__last_seen_at, DAY) AS user__days_since_last_visit
FROM user_first_last u
LEFT JOIN user_identity i ON u.user__pseudo_id = i.user__pseudo_id

Common Pitfalls and Solutions

After implementing this pattern across multiple projects, these issues come up repeatedly:

PitfallSymptomSolution
Resource exceededQuery fails with “Resources exceeded”Add partitioning to intermediate model; reduce lookback window; use clustering
Slow buildsIncremental runs take 30+ minutesCheck for full table scans in compiled SQL; ensure WHERE clause uses partition column
Data mismatch vs. UI1-2% variance with GA4 reportsExpected. GA4 UI uses sampling and different aggregation. Document and accept.
Missing late dataConversions appear days laterIncrease ga4_static_incremental_days to 4-5
Session IDs collideDuplicate session__keysAlways use composite key: user__pseudo_id + session__ga_id
Landing page is NULLMany sessions show NULL landing pageUse IGNORE NULLS in window function; some events don’t have page__location
Channel grouping differsYour “Paid Search” doesn’t match GA4Review and align regex patterns in macro with your UTM conventions

Conclusion

You now have a complete, production-ready GA4 dbt project:

  • Single base model that converts sharded tables to partitioned, cleans data, and extracts parameters
  • Reusable macros for parameter extraction and channel grouping
  • Wide intermediate table where every event carries session context
  • Mart layer with session and user tables
  • Comprehensive testing that catches GA4-specific issues
  • Documentation that makes the project maintainable

This architecture ties together everything from the series: the schema knowledge from Article 1, unnesting patterns from Article 2, sessionization logic from Article 3, and user stitching from Article 4.

The wide intermediate table approach differs from typical GA4 packages that build session-grain marts directly. By keeping event-level grain with session context attached, you enable flexible downstream analysis (funnel conversion rates, event-level attribution, engagement analysis) without repeatedly joining back to session tables.