Adrienne Vermorel
GA4 + dbt: A Production-Ready Project Template
You’ve mastered the GA4 BigQuery schema. You can unnest event parameters in your sleep. You’ve built session tables and implemented user stitching. Now comes the real challenge: combining everything into a dbt project that actually works in production.
This article delivers a complete project template centered on one core idea: a wide intermediate table at event-level grain with sessionized columns. Every event row carries its session context (landing page, exit page, whether the session converted, total pageviews). This design enables flexible downstream analysis without repeatedly joining session tables.
We’ll build this from scratch, borrowing patterns from established packages where they make sense. You’ll get incremental processing that handles late-arriving data, tests that catch GA4-specific issues, and documentation that keeps the project maintainable.
What We Can Learn from Existing Packages
Before building custom, let’s examine what’s already out there. The open-source ecosystem has converged on a few approaches worth understanding.
Velir/dbt-ga4: The Community Standard
The Velir/dbt-ga4 package dominates the space with 380+ GitHub stars and active development. Its creator positioned it as a home for communal knowledge around GA4 and BigQuery.
The architecture follows a clean base-to-marts pattern:
models/├── base/│ ├── base/base_ga4__events│ ├── stg_ga4__events│ ├── stg_ga4__event_page_view│ └── stg_ga4__sessions_traffic_sources└── marts/ └── core/ ├── dim_ga4__sessions └── fct_ga4__sessionsThree patterns from Velir worth borrowing:
Variable-driven parameter extraction. Rather than hardcoding which event parameters to extract, Velir uses dbt_project.yml variables:
vars: ga4: page_view_custom_parameters: - name: "clean_event" value_type: "string_value"Static incremental lookback. Instead of dynamically querying the destination table for the latest date (which breaks BigQuery optimization), Velir reprocesses the last N days on every run. We’ll adopt this pattern.
Composite session keys. Sessions are identified by user_pseudo_id + ga_session_id, not ga_session_id alone. This prevents collisions across users.
But non-partitioned session models become expensive at scale. The documentation explicitly recommends disabling certain models for high-traffic sites.
Other Notable Packages
Fivetran/dbt_ga4_export targets teams using Fivetran connectors. It supports multiple databases (Snowflake, Redshift, Databricks) but provides only 7 models total, assuming event parameters are flattened during ingestion.
MO Data Consulting’s package takes a different approach: dynamically flattening all event_params into individual columns at compile time. Convenient for exploration, but can create unwieldy tables.
admindio/simple_ga4_dbt offers a streamlined, cost-efficient approach with built-in attribution models and minimal dependencies.
Why Build Custom?
These packages optimize for session-grain marts. Our goal is different: a wide event-level table where every row carries session context. This enables:
- Funnel analysis without session joins
- Event-level attribution with session awareness
- Flexible aggregation at any grain downstream
We’ll build our own structure, borrowing the best patterns from existing work.
Project Architecture
Folder Structure
models/├── base/│ └── ga4/│ ├── _ga4__sources.yml│ ├── _ga4__models.yml│ └── base__ga4__events.sql ← Single base model├── intermediate/│ └── ga4/│ ├── _int_ga4__models.yml│ ├── int__ga4__event_items.sql ← Item-level grain│ └── int__ga4__events_sessionized.sql ← The main table └── ga4/ ├── _mrt_ga4__models.yml ├── mrt__analytics__sessions.sql └── mrt__analytics__users.sql
macros/└── ga4/ ├── extract_event_param.sql ├── generate_surrogate_key.sql └── default_channel_grouping.sqlHow This Ties to the Series
Each previous article maps to a component:
| Article | Maps to |
|---|---|
| 1. Schema Reference | base__ga4__events.sql (sharded to partitioned + cleaning) |
| 2. Unnesting Patterns | Parameter extraction macros in base__ga4__events.sql |
| 3. Session Tables | Session columns in int__ga4__events_sessionized.sql |
| 4. User Stitching | Identity resolution feeding mrt__analytics__users.sql |
Naming Conventions
We follow dbt’s standard prefixes:
base_(source data, cleaned and typed)int_(intermediate transformations, not exposed to end users)mrt_(mart tables for business reporting)
The Base Layer
GA4 exports data as date-sharded tables (events_YYYYMMDD) rather than a single partitioned table. This breaks standard incremental strategies because dynamic WHERE clauses cause full table scans (BigQuery must scan all shards to evaluate the filter).
Our single base model handles everything: sharded-to-partitioned conversion, cleaning, typing, and parameter extraction.
-- models/base/ga4/base__ga4__events.sql
{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={ "field": "event__date", "data_type": "date", "granularity": "day" }, cluster_by=['user__pseudo_id', 'session__key', 'event__name'] )}}
{% set lookback_days = var('ga4_static_incremental_days', 3) %}
WITH source AS (
SELECT event_date, event_timestamp, event_name, event_params, event_bundle_sequence_id, user_id, user_pseudo_id, user_properties, device, geo, ecommerce, items, collected_traffic_source, is_active_user FROM {{ source('ga4', 'events') }} WHERE _TABLE_SUFFIX >= '{{ var("ga4_start_date", "20230101") }}'
{% if is_incremental() %} AND _TABLE_SUFFIX >= FORMAT_DATE( '%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL {{ lookback_days }} DAY) ) {% endif %}
-- Exclude intraday tables AND _TABLE_SUFFIX NOT LIKE '%intraday%'
),
cleaned AS (
SELECT -- Timestamps PARSE_DATE('%Y%m%d', event_date) AS event__date, TIMESTAMP_MICROS(event_timestamp) AS event__timestamp_utc, event_timestamp AS event__timestamp_raw,
-- User identifiers user_pseudo_id AS user__pseudo_id, user_id AS user__id,
-- Session identifiers {{ extract_event_param('event_params', 'ga_session_id', 'int_value') }} AS session__ga_id, {{ extract_event_param('event_params', 'ga_session_number', 'int_value') }} AS session__number,
-- Event details event_name AS event__name, event_bundle_sequence_id AS event__bundle_sequence_id,
-- Page information {{ extract_event_param('event_params', 'page_location', 'string_value') }} AS page__location, {{ extract_event_param('event_params', 'page_title', 'string_value') }} AS page__title, {{ extract_event_param('event_params', 'page_referrer', 'string_value') }} AS page__referrer,
-- Traffic source (event-level) {{ extract_event_param('event_params', 'source', 'string_value') }} AS event__source, {{ extract_event_param('event_params', 'medium', 'string_value') }} AS event__medium, {{ extract_event_param('event_params', 'campaign', 'string_value') }} AS event__campaign,
-- Engagement {{ extract_event_param('event_params', 'session_engaged', 'string_value') }} AS session__is_engaged, {{ extract_event_param('event_params', 'engagement_time_msec', 'int_value') }} AS event__engagement_time_msec, {{ extract_event_param('event_params', 'entrances', 'int_value') }} AS event__is_entrance,
-- Device and geo device.category AS device__category, device.mobile_brand_name AS device__brand, device.operating_system AS device__os, device.web_info.browser AS device__browser, geo.country AS geo__country, geo.city AS geo__city,
-- Traffic source (session-level, from collected_traffic_source) collected_traffic_source.manual_source AS session__source, collected_traffic_source.manual_medium AS session__medium, collected_traffic_source.manual_campaign_name AS session__campaign, collected_traffic_source.gclid AS session__gclid,
-- Ecommerce ecommerce.transaction_id AS transaction__id, ecommerce.purchase_revenue AS transaction__revenue,
-- Keep nested structures for downstream models event_params AS event__params, user_properties AS user__properties, items AS items__array
FROM source
)
SELECT -- Primary key {{ generate_surrogate_key(['user__pseudo_id', 'event__timestamp_raw', 'event__name']) }} AS event__key,
-- Session key (composite) {{ generate_surrogate_key(['user__pseudo_id', 'session__ga_id']) }} AS session__key,
-- Timestamps event__date, event__timestamp_utc, event__timestamp_raw,
-- User identifiers user__pseudo_id, user__id,
-- Session identifiers session__ga_id, session__number,
-- Event details event__name, event__bundle_sequence_id,
-- Page information page__location, page__title, page__referrer,
-- Traffic source (event-level) event__source, event__medium, event__campaign,
-- Engagement session__is_engaged, event__engagement_time_msec, event__is_entrance,
-- Device and geo device__category, device__brand, device__os, device__browser, geo__country, geo__city,
-- Traffic source (session-level) session__source, session__medium, session__campaign, session__gclid,
-- Ecommerce transaction__id, transaction__revenue,
-- Nested structures event__params, user__properties, items__array
FROM cleanedWhy 3 days lookback? GA4 can update events for 72+ hours after initial export. Conversions tracked via Google Ads, cross-device stitching updates, and consent-mode adjustments all cause late-arriving data. The static lookback reprocesses recent partitions on every run, ensuring nothing is missed.
Why insert_overwrite? For event data (immutable, date-partitioned), this strategy replaces entire partitions cleanly. It’s faster and cheaper than merge, which scans both source and destination.
Macros
The Parameter Extraction Macro
The extract_event_param macro encapsulates the inline subquery pattern:
-- macros/ga4/extract_event_param.sql
{% macro extract_event_param(params_column, param_key, value_type='string_value') %} ( SELECT value.{{ value_type }} FROM UNNEST({{ params_column }}) WHERE key = '{{ param_key }}' LIMIT 1 ){% endmacro %}This approach avoids row multiplication (no Cartesian product from UNNEST) while keeping the SQL readable. For numeric parameters stored in different value fields:
{% macro extract_event_param_numeric(params_column, param_key) %} ( SELECT COALESCE( value.int_value, CAST(value.float_value AS INT64), CAST(value.double_value AS INT64) ) FROM UNNEST({{ params_column }}) WHERE key = '{{ param_key }}' LIMIT 1 ){% endmacro %}The Intermediate Layer
Handling the Items Array
Ecommerce items require a separate model at item-level grain:
-- models/intermediate/ga4/int__ga4__event_items.sql
{{ config( materialized='view' )}}
WITH events_with_items AS (
SELECT event__key, event__date, event__timestamp_utc, user__pseudo_id, session__key, event__name, transaction__id, items__array FROM {{ ref('base__ga4__events') }} WHERE event__name IN ('purchase', 'add_to_cart', 'remove_from_cart', 'view_item', 'begin_checkout', 'add_payment_info') AND ARRAY_LENGTH(items__array) > 0
)
SELECT e.event__key, e.event__date, e.event__timestamp_utc, e.user__pseudo_id, e.session__key, e.event__name, e.transaction__id,
-- Item details item.item_id AS item__id, item.item_name AS item__name, item.item_brand AS item__brand, item.item_category AS item__category, item.item_category2 AS item__category2, item.item_category3 AS item__category3, item.item_variant AS item__variant, item.price AS item__price, item.quantity AS item__quantity, item.coupon AS item__coupon, item.item_list_name AS item__list_name, item.item_list_index AS item__list_index,
-- Calculate item revenue COALESCE(item.price, 0) * COALESCE(item.quantity, 1) AS item__revenue
FROM events_with_items e,UNNEST(e.items__array) AS itemThis model intentionally uses Cartesian UNNEST because we need the item-level grain.
The Wide Sessionized Table
Every event row gets session context attached via window functions.
-- models/intermediate/ga4/int__ga4__events_sessionized.sql
{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={ "field": "event__date", "data_type": "date", "granularity": "day" }, cluster_by=['user__pseudo_id', 'session__key', 'event__name'] )}}
{% set lookback_days = var('ga4_static_incremental_days', 3) %}
WITH events AS (
SELECT event__key, session__key, event__date, event__timestamp_utc, user__pseudo_id, user__id, session__ga_id, session__number, event__name, page__location, page__title, event__source, event__medium, event__campaign, session__source, session__medium, session__campaign, event__engagement_time_msec, device__category, geo__country, transaction__revenue FROM {{ ref('base__ga4__events') }}
{% if is_incremental() %} WHERE event__date >= DATE_SUB(CURRENT_DATE(), INTERVAL {{ lookback_days }} DAY) {% endif %}
),
-- Aggregate items per event (avoid repeated joins)event_items AS (
SELECT event__key, COUNT(*) AS event__items, SUM(item__revenue) AS event__items_revenue FROM {{ ref('int__ga4__event_items') }} {% if is_incremental() %} WHERE event__date >= DATE_SUB(CURRENT_DATE(), INTERVAL {{ lookback_days }} DAY) {% endif %} GROUP BY event__key
),
-- Clean page URLs for landing/exit page analysiswith_clean_urls AS (
SELECT *, -- Remove query parameters for cleaner landing/exit pages REGEXP_REPLACE( SPLIT(page__location, '?')[SAFE_OFFSET(0)], r'#.*$', '' ) AS page__path FROM events
),
-- Add session-level metrics via window functionswith_session_metrics AS (
SELECT e.*,
-- Landing and exit pages FIRST_VALUE(page__path IGNORE NULLS) OVER (session_window) AS session__landing_page, FIRST_VALUE(page__location IGNORE NULLS) OVER (session_window) AS session__landing_page_full, LAST_VALUE(page__path IGNORE NULLS) OVER (session_window_full) AS session__exit_page,
-- Session traffic source (first non-null source in session) FIRST_VALUE( COALESCE(event__source, session__source) IGNORE NULLS ) OVER (session_window) AS session__source_final, FIRST_VALUE( COALESCE(event__medium, session__medium) IGNORE NULLS ) OVER (session_window) AS session__medium_final, FIRST_VALUE( COALESCE(event__campaign, session__campaign) IGNORE NULLS ) OVER (session_window) AS session__campaign_final,
-- Session engagement metrics COUNT(*) OVER (PARTITION BY session__key) AS session__events, COUNT(CASE WHEN event__name = 'page_view' THEN 1 END) OVER (PARTITION BY session__key) AS session__pageviews, SUM(event__engagement_time_msec) OVER (PARTITION BY session__key) AS session__engagement_time_msec,
-- Session duration TIMESTAMP_DIFF( MAX(event__timestamp_utc) OVER (PARTITION BY session__key), MIN(event__timestamp_utc) OVER (PARTITION BY session__key), SECOND ) AS session__duration_seconds,
-- Session conversion flags MAX(CASE WHEN event__name = 'purchase' THEN 1 ELSE 0 END) OVER (PARTITION BY session__key) AS session__has_purchase, MAX(CASE WHEN event__name = 'add_to_cart' THEN 1 ELSE 0 END) OVER (PARTITION BY session__key) AS session__has_add_to_cart, MAX(CASE WHEN event__name = 'begin_checkout' THEN 1 ELSE 0 END) OVER (PARTITION BY session__key) AS session__has_checkout, MAX(CASE WHEN event__name = 'sign_up' THEN 1 ELSE 0 END) OVER (PARTITION BY session__key) AS session__has_signup, MAX(CASE WHEN event__name = 'generate_lead' THEN 1 ELSE 0 END) OVER (PARTITION BY session__key) AS session__has_lead,
-- Session revenue SUM(transaction__revenue) OVER (PARTITION BY session__key) AS session__revenue,
-- Event sequencing within session ROW_NUMBER() OVER ( PARTITION BY session__key ORDER BY event__timestamp_utc ) AS event__sequence_number,
-- Is this the first/last event in session? CASE WHEN ROW_NUMBER() OVER ( PARTITION BY session__key ORDER BY event__timestamp_utc ) = 1 THEN TRUE ELSE FALSE END AS event__is_session_start,
CASE WHEN ROW_NUMBER() OVER ( PARTITION BY session__key ORDER BY event__timestamp_utc DESC ) = 1 THEN TRUE ELSE FALSE END AS event__is_session_end,
-- Time since previous event (for engagement analysis) TIMESTAMP_DIFF( event__timestamp_utc, LAG(event__timestamp_utc) OVER ( PARTITION BY session__key ORDER BY event__timestamp_utc ), SECOND ) AS event__seconds_since_previous
FROM with_clean_urls e
WINDOW session_window AS ( PARTITION BY session__key ORDER BY event__timestamp_utc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ), session_window_full AS ( PARTITION BY session__key ORDER BY event__timestamp_utc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING )
),
-- Add channel groupingwith_channel AS (
SELECT *, {{ default_channel_grouping('session__source_final', 'session__medium_final') }} AS session__channel_grouping FROM with_session_metrics
),
-- Join item aggregatesfinal AS (
SELECT e.*, COALESCE(i.event__items, 0) AS event__items, COALESCE(i.event__items_revenue, 0) AS event__items_revenue FROM with_channel e LEFT JOIN event_items i ON e.event__key = i.event__key
)
SELECT * FROM finalCritical Window Function Details
The LAST_VALUE trap. Without explicit framing, LAST_VALUE returns the current row:
-- WRONG: Returns current row's pageLAST_VALUE(page__path) OVER (PARTITION BY session__key ORDER BY event__timestamp_utc)
-- CORRECT: Returns actual last pageLAST_VALUE(page__path) OVER ( PARTITION BY session__key ORDER BY event__timestamp_utc ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)IGNORE NULLS matters. GA4 doesn’t track page_location for all events. Using IGNORE NULLS ensures actual page values:
FIRST_VALUE(page__path IGNORE NULLS) OVER (session_window) AS session__landing_pageBoolean flags via MAX. To propagate a flag to all events in a session:
MAX(CASE WHEN event__name = 'purchase' THEN 1 ELSE 0 END) OVER (PARTITION BY session__key) AS session__has_purchaseThe Channel Grouping Macro
Google’s default channel grouping logic, encapsulated in a reusable macro:
-- macros/ga4/default_channel_grouping.sql
{% macro default_channel_grouping(source_column, medium_column) %} CASE -- Direct WHEN {{ source_column }} IS NULL OR {{ source_column }} = '(direct)' OR {{ source_column }} = '(none)' THEN 'Direct'
-- Paid Search WHEN REGEXP_CONTAINS({{ medium_column }}, r'^(cpc|ppc|paidsearch)$') OR ({{ source_column }} = 'google' AND {{ medium_column }} = 'sem') THEN 'Paid Search'
-- Paid Social WHEN REGEXP_CONTAINS({{ medium_column }}, r'^(paidsocial|paid-social|paid_social)$') OR (REGEXP_CONTAINS({{ source_column }}, r'facebook|instagram|linkedin|twitter|tiktok') AND REGEXP_CONTAINS({{ medium_column }}, r'^(cpc|ppc|paid)')) THEN 'Paid Social'
-- Display WHEN REGEXP_CONTAINS({{ medium_column }}, r'^(display|cpm|banner)$') THEN 'Display'
-- Organic Search WHEN {{ medium_column }} = 'organic' OR REGEXP_CONTAINS({{ source_column }}, r'google|bing|yahoo|duckduckgo|baidu') AND {{ medium_column }} = 'organic' THEN 'Organic Search'
-- Organic Social WHEN REGEXP_CONTAINS({{ source_column }}, r'facebook|instagram|linkedin|twitter|tiktok|pinterest|youtube') AND ({{ medium_column }} IS NULL OR {{ medium_column }} IN ('social', 'referral', '(none)')) THEN 'Organic Social'
-- Email WHEN REGEXP_CONTAINS({{ medium_column }}, r'^(email|e-mail|e_mail)$') THEN 'Email'
-- Affiliate WHEN REGEXP_CONTAINS({{ medium_column }}, r'^(affiliate|affiliates)$') THEN 'Affiliates'
-- Referral WHEN {{ medium_column }} = 'referral' THEN 'Referral'
-- Other ELSE 'Other' END{% endmacro %}Testing and Data Quality
Standard dbt tests miss GA4-specific issues. This section covers a comprehensive testing strategy.
Source Configuration with Freshness
version: 2
sources: - name: ga4 database: "{{ var('ga4_project_id') }}" schema: "{{ var('ga4_dataset') }}" description: "GA4 BigQuery Export"
tables: - name: events identifier: "events_*" description: "Date-sharded GA4 events table" freshness: warn_after: {count: 24, period: hour} error_after: {count: 48, period: hour} loaded_at_field: "TIMESTAMP_MICROS(event_timestamp)"Base Model Tests
version: 2
models: - name: base__ga4__events description: "GA4 events: sharded to partitioned, cleaned, typed, with session keys"
columns: - name: event__key description: "Surrogate key for the event" tests: - unique - not_null
- name: user__pseudo_id description: "Client ID (device identifier)" tests: - not_null
- name: event__timestamp_utc description: "Event timestamp in UTC" tests: - not_null
- name: session__key description: "Composite session identifier" tests: - not_nullGA4-Specific Singular Tests
Test for missing session_start events:
-- tests/singular/test_sessions_missing_session_start.sql
-- Sessions should have a session_start event-- Some sessions may legitimately miss it (ad blockers, consent),-- but a high percentage indicates tracking issues
WITH session_stats AS ( SELECT session__key, COUNT(*) AS session__events, MAX(CASE WHEN event__name = 'session_start' THEN 1 ELSE 0 END) AS session__has_start FROM {{ ref('int__ga4__events_sessionized') }} WHERE event__date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY) GROUP BY session__key)
SELECT session__key, session__events, session__has_startFROM session_statsWHERE session__has_start = 0 AND session__events > 3 -- Legitimate sessions with multiple eventsLIMIT 100Test for orphaned transactions:
-- tests/singular/test_purchase_without_session.sql
-- Purchases should have valid session context
SELECT event__key, event__date, session__ga_id, user__pseudo_idFROM {{ ref('base__ga4__events') }}WHERE event__name = 'purchase' AND transaction__id IS NOT NULL AND (session__ga_id IS NULL OR user__pseudo_id IS NULL) AND event__date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)LIMIT 100Schema Tests for the Wide Table
version: 2
models: - name: int__ga4__events_sessionized description: "Wide event table with session context on every row"
columns: - name: event__key tests: - unique - not_null
- name: session__key tests: - not_null
- name: session__landing_page description: "First page path in the session" tests: - not_null: where: "event__name = 'page_view'"
- name: session__pageviews tests: - dbt_utils.accepted_range: min_value: 0 max_value: 1000 # Flag suspicious sessions
- name: session__duration_seconds tests: - dbt_utils.accepted_range: min_value: 0 max_value: 86400 # 24 hours max
- name: session__channel_grouping tests: - accepted_values: values: ['Direct', 'Organic Search', 'Paid Search', 'Paid Social', 'Organic Social', 'Email', 'Referral', 'Display', 'Affiliates', 'Other']Documentation and Lineage
Good documentation makes the project maintainable.
Source Documentation
version: 2
sources: - name: ga4 description: | GA4 BigQuery Export tables.
See [Google's schema documentation](https://support.google.com/analytics/answer/7029846) for the complete field reference.
**Important notes:** - Tables are date-sharded as `events_YYYYMMDD` - Intraday tables (`events_intraday_YYYYMMDD`) are excluded from processing - Data can arrive up to 72 hours late
tables: - name: events description: | Raw GA4 event data exported to BigQuery.
Each row represents a single event (page_view, purchase, custom events, etc.). Event parameters are stored in nested RECORD fields that require unnesting.Model Documentation
version: 2
models: - name: int__ga4__events_sessionized description: | Wide event table with session context attached to every row.
**Grain:** One row per event
**Key design decisions:** - Session metrics (session__landing_page, session__exit_page, session__has_purchase) are computed via window functions and attached to every event in the session - Traffic source uses first non-null source in session (matches GA4 behavior) - Channel grouping follows Google's default rules
**Usage:** - Use for event-level analysis with session context - Aggregate to session grain using `WHERE event__is_session_start = TRUE` - Join to `mrt__analytics__users` for user-level analysis
columns: - name: session__key description: | Composite key of user__pseudo_id + session__ga_id.
**Why not just session__ga_id?** Session IDs are only unique within a user_pseudo_id. Different users can have the same ga_session_id value.
- name: session__landing_page description: | First page path (without query parameters) viewed in the session.
Computed using `FIRST_VALUE(page__path IGNORE NULLS)` to handle events that don't have page information.
- name: session__has_purchase description: | 1 if any event in this session was a purchase, 0 otherwise.
This flag is on every row in the session, enabling funnel analysis without joins.The Project README
Create a README at the project root:
# GA4 dbt Project
This project transforms raw GA4 BigQuery exports into analytics-ready tables.
## Quick Start
1. Configure your variables in `dbt_project.yml`: ```yaml vars: ga4_project_id: "your-gcp-project" ga4_dataset: "analytics_123456789" ga4_start_date: "20230101" ga4_static_incremental_days: 3-
Run the project:
Terminal window # Initial full loaddbt build --full-refresh# Daily incrementaldbt build
Key Models
| Model | Grain | Description |
|---|---|---|
int__ga4__events_sessionized | Event | Every event with session context |
mrt__analytics__sessions | Session | One row per session |
mrt__analytics__users | User | One row per user with stitched identity |
Architecture
base__ga4__events (sharded → partitioned, cleaned, typed) ↓int__ga4__events_sessionized (sessionized) ↓mrt__analytics__sessions / mrt__analytics__users (marts)Known Limitations
-
Data latency: GA4 data can arrive up to 72 hours late. We reprocess the last 3 days on every run to capture late-arriving events.
-
UI discrepancies: Expect 1-2% variance compared to GA4 reports. The UI uses sampled data and different aggregation methods.
-
Session attribution: We use the first non-null source in the session for attribution, which matches GA4’s session-scoped traffic source behavior.
### Lineage Visualization
Generate dbt docs for visual lineage:
```bashdbt docs generatedbt docs serveFor a simple Mermaid diagram in your README:
graph LR A[events_*] --> B[base__ga4__events] B --> C[int__ga4__event_items] B --> D[int__ga4__events_sessionized] C --> D D --> E[mrt__analytics__sessions] D --> F[mrt__analytics__users]Putting It All Together
The Complete dbt_project.yml
name: 'ga4_analytics'version: '1.0.0'
profile: 'ga4_analytics'
model-paths: ["models"]analysis-paths: ["analyses"]test-paths: ["tests"]seed-paths: ["seeds"]macro-paths: ["macros"]
target-path: "target"clean-targets: - "target" - "dbt_packages"
vars: # GCP Configuration ga4_project_id: "{{ env_var('GA4_PROJECT_ID') }}" ga4_dataset: "{{ env_var('GA4_DATASET', 'analytics_123456789') }}"
# Processing Configuration ga4_start_date: "20230101" ga4_static_incremental_days: 3
# Business Configuration ga4_conversion_events: - 'purchase' - 'sign_up' - 'generate_lead' - 'contact_form_submit'
# URL Cleaning ga4_query_params_to_remove: - 'gclid' - 'fbclid' - 'utm_id' - '_ga'
models: ga4_analytics: base: +schema: base +materialized: view +tags: ['ga4', 'base'] ga4: +materialized: incremental
intermediate: +schema: intermediate +materialized: incremental +tags: ['ga4', 'intermediate']
marts: +schema: marts +materialized: table +tags: ['ga4', 'marts']
tests: +severity: warn +store_failures: trueRunning the Project
Initial setup (full refresh):
# Test connectiondbt debug
# Run full refresh to backfill all datadbt build --full-refresh
# Generate documentationdbt docs generateDaily operations:
# Standard incremental rundbt build
# Run with specific date range (for backfills)dbt build --vars '{"ga4_start_date": "20240101"}'Recommended schedule: Run daily, 4-6 hours after midnight in your primary timezone. This allows time for GA4’s daily export to complete and most late-arriving data to land.
The Mart Layer
With the wide intermediate table complete, building marts is straightforward.
Session Mart
-- models/marts/ga4/mrt__analytics__sessions.sql
{{ config( materialized='table', partition_by={ "field": "event__date", "data_type": "date" }, cluster_by=['session__channel_grouping', 'device__category'] )}}
SELECT session__key, user__pseudo_id, session__ga_id, session__number,
-- Timestamps event__date, MIN(event__timestamp_utc) AS session__started_at, MAX(event__timestamp_utc) AS session__ended_at,
-- Session attributes (take from any row, they're all the same) ANY_VALUE(session__landing_page) AS session__landing_page, ANY_VALUE(session__exit_page) AS session__exit_page, ANY_VALUE(session__source_final) AS session__source, ANY_VALUE(session__medium_final) AS session__medium, ANY_VALUE(session__campaign_final) AS session__campaign, ANY_VALUE(session__channel_grouping) AS session__channel_grouping, ANY_VALUE(device__category) AS device__category, ANY_VALUE(geo__country) AS geo__country,
-- Engagement metrics ANY_VALUE(session__pageviews) AS session__pageviews, ANY_VALUE(session__events) AS session__events, ANY_VALUE(session__duration_seconds) AS session__duration_seconds, ANY_VALUE(session__engagement_time_msec) / 1000 AS session__engagement_time_seconds,
-- Conversion flags ANY_VALUE(session__has_purchase) AS session__has_purchase, ANY_VALUE(session__has_add_to_cart) AS session__has_add_to_cart, ANY_VALUE(session__has_checkout) AS session__has_checkout, ANY_VALUE(session__has_signup) AS session__has_signup,
-- Revenue ANY_VALUE(session__revenue) AS session__revenue
FROM {{ ref('int__ga4__events_sessionized') }}GROUP BY 1, 2, 3, 4, 5User Mart
This model incorporates user stitching from Article 4:
-- models/marts/ga4/mrt__analytics__users.sql
{{ config( materialized='table', cluster_by=['user__first_seen_at'] )}}
WITH user_sessions AS (
SELECT user__pseudo_id, user__id, session__key, event__date, session__started_at, session__channel_grouping, device__category, geo__country, session__has_purchase, session__revenue FROM {{ ref('mrt__analytics__sessions') }}
),
user_first_last AS (
SELECT user__pseudo_id,
-- First seen MIN(event__date) AS user__first_seen_at, FIRST_VALUE(session__channel_grouping) OVER ( PARTITION BY user__pseudo_id ORDER BY event__date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS user__first_channel, FIRST_VALUE(device__category) OVER ( PARTITION BY user__pseudo_id ORDER BY event__date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS user__first_device,
-- Last seen MAX(event__date) AS user__last_seen_at, LAST_VALUE(session__channel_grouping) OVER ( PARTITION BY user__pseudo_id ORDER BY event__date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS user__last_channel,
-- Aggregates COUNT(DISTINCT session__key) AS user__sessions, MAX(session__has_purchase) AS user__has_purchased, SUM(session__revenue) AS user__lifetime_revenue
FROM user_sessions GROUP BY user__pseudo_id
),
-- Stitch user_id (from Article 4)user_identity AS (
SELECT DISTINCT user__pseudo_id, FIRST_VALUE(user__id IGNORE NULLS) OVER ( PARTITION BY user__pseudo_id ORDER BY event__date DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING ) AS user__stitched_id FROM user_sessions WHERE user__id IS NOT NULL
)
SELECT u.user__pseudo_id, i.user__stitched_id AS user__id, u.user__first_seen_at, u.user__last_seen_at, u.user__first_channel, u.user__first_device, u.user__last_channel, u.user__sessions, u.user__has_purchased, u.user__lifetime_revenue,
-- Derived metrics DATE_DIFF(u.user__last_seen_at, u.user__first_seen_at, DAY) AS user__lifespan_days, DATE_DIFF(CURRENT_DATE(), u.user__last_seen_at, DAY) AS user__days_since_last_visit
FROM user_first_last uLEFT JOIN user_identity i ON u.user__pseudo_id = i.user__pseudo_idCommon Pitfalls and Solutions
After implementing this pattern across multiple projects, these issues come up repeatedly:
| Pitfall | Symptom | Solution |
|---|---|---|
| Resource exceeded | Query fails with “Resources exceeded” | Add partitioning to intermediate model; reduce lookback window; use clustering |
| Slow builds | Incremental runs take 30+ minutes | Check for full table scans in compiled SQL; ensure WHERE clause uses partition column |
| Data mismatch vs. UI | 1-2% variance with GA4 reports | Expected. GA4 UI uses sampling and different aggregation. Document and accept. |
| Missing late data | Conversions appear days later | Increase ga4_static_incremental_days to 4-5 |
| Session IDs collide | Duplicate session__keys | Always use composite key: user__pseudo_id + session__ga_id |
| Landing page is NULL | Many sessions show NULL landing page | Use IGNORE NULLS in window function; some events don’t have page__location |
| Channel grouping differs | Your “Paid Search” doesn’t match GA4 | Review and align regex patterns in macro with your UTM conventions |
Conclusion
You now have a complete, production-ready GA4 dbt project:
- Single base model that converts sharded tables to partitioned, cleans data, and extracts parameters
- Reusable macros for parameter extraction and channel grouping
- Wide intermediate table where every event carries session context
- Mart layer with session and user tables
- Comprehensive testing that catches GA4-specific issues
- Documentation that makes the project maintainable
This architecture ties together everything from the series: the schema knowledge from Article 1, unnesting patterns from Article 2, sessionization logic from Article 3, and user stitching from Article 4.
The wide intermediate table approach differs from typical GA4 packages that build session-grain marts directly. By keeping event-level grain with session context attached, you enable flexible downstream analysis (funnel conversion rates, event-level attribution, engagement analysis) without repeatedly joining back to session tables.