GA4 Sharded-to-Partitioned Base Model

GA4 exports data to BigQuery as date-sharded tables: events_20260101, events_20260102, and so on. This isn’t a single table with a date partition — it’s separate tables per day, all referenced via a wildcard pattern (events_*). The distinction matters enormously for how you build incremental models.

The Problem with Standard Incremental Patterns

The typical incremental approach queries the destination table to find the latest processed date, then filters the source to only newer records:

{% if is_incremental() %}
WHERE created_at > (SELECT MAX(created_at) FROM {{ this }})
{% endif %}

This breaks with GA4’s sharded tables. When you filter with WHERE event_date > '2026-01-28' on a wildcard table, BigQuery cannot prune shards based on that column filter — it has to scan all events_* tables to evaluate the WHERE clause. A 3-year history means scanning 1,000+ shards every incremental run, regardless of how recent your filter is.

The solution is static incremental lookback: filter by _TABLE_SUFFIX rather than the event date column. BigQuery knows which physical tables to open based on the suffix, enabling true partition pruning before any data is read.

The Base Model Pattern

The GA4 base model handles four responsibilities: sharded-to-partitioned conversion, cleaning, typing, and parameter extraction. All of it happens in a single model.

-- models/base/ga4/base__ga4__events.sql

{{
    config(
        materialized='incremental',
        incremental_strategy='insert_overwrite',
        partition_by={
            "field": "event__date",
            "data_type": "date",
            "granularity": "day"
        },
        cluster_by=['user__pseudo_id', 'session__key', 'event__name']
    )
}}

{% set lookback_days = var('ga4_static_incremental_days', 3) %}

WITH source AS (

    SELECT
        event_date,
        event_timestamp,
        event_name,
        event_params,
        user_pseudo_id,
        user_id,
        device,
        geo,
        ecommerce,
        items,
        collected_traffic_source,
        is_active_user
    FROM {{ source('ga4', 'events') }}
    WHERE _TABLE_SUFFIX >= '{{ var("ga4_start_date", "20230101") }}'

    {% if is_incremental() %}
        AND _TABLE_SUFFIX >= FORMAT_DATE(
            '%Y%m%d',
            DATE_SUB(CURRENT_DATE(), INTERVAL {{ lookback_days }} DAY)
        )
    {% endif %}

    -- Exclude intraday tables
    AND _TABLE_SUFFIX NOT LIKE '%intraday%'

)

Key decisions:

_TABLE_SUFFIX filter, not event_date filter. The suffix filter tells BigQuery exactly which physical tables to open. On a 3-year history, this difference can mean reading 3 days of data instead of all 1,000+ shards.

insert_overwrite with date partitioning. The destination table is properly partitioned by event__date. On each incremental run, dbt replaces only the partitions that correspond to the lookback window. This is atomic (no brief inconsistent state) and efficient (no row-level comparison like merge).

Static CURRENT_DATE() - N lookback, not MAX(event_date) - N. The static approach has a subtle advantage: it works even if the destination table is empty or if today’s data hasn’t landed yet. The dynamic approach (MAX(event_date) from destination) can break in edge cases and triggers a full scan of the destination table.

Why 3 Days?

GA4 data isn’t finalized when it first lands. Several processes retroactively update already-exported events:

Google Ads conversion stitching — Conversions from Ads campaigns may not be attributed until the Ads system processes the click, which can take 24-72 hours
Cross-device stitching updates — When a user’s user_id is resolved across devices, Google may backfill session attribution
Consent Mode adjustments — Behavioral modeling for consented sessions updates as more data accumulates

A 3-day lookback catches the vast majority of these retroactive updates. Some teams use 4-5 days for paid campaigns where attribution accuracy is critical for reporting.

The Intraday Exclusion

GA4 exports intraday tables (events_intraday_YYYYMMDD) continuously throughout the day, then consolidates into a final daily table the following day. The intraday tables are useful for near-real-time reporting but create a problem for incremental models: if you process an intraday table and then the final daily table arrives with different data, you’ll have inconsistencies.

Excluding intraday tables with AND _TABLE_SUFFIX NOT LIKE '%intraday%' keeps the model simple and consistent. Accept the 24-hour lag in exchange for data you can trust.

Column Naming Convention

The base model establishes the naming convention for the entire project. Using the double-underscore pattern (entity__attribute) makes the data model self-documenting:

event__date, event__timestamp_utc — event-scoped attributes
user__pseudo_id, user__id — user identity
session__ga_id, session__key — session identity
page__location, page__title — page attributes
device__category, geo__country — context dimensions

This convention scales: when you see session__landing_page in a downstream model, you immediately know it’s a session-scoped attribute about a page, computed from the session context — not a raw GA4 field.

Source Configuration

Pair the base model with a source configuration that expresses data freshness expectations:

sources:
  - name: ga4
    database: "{{ var('ga4_project_id') }}"
    schema: "{{ var('ga4_dataset') }}"

    tables:
      - name: events
        identifier: "events_*"
        freshness:
          warn_after: {count: 24, period: hour}
          error_after: {count: 48, period: hour}
        loaded_at_field: "TIMESTAMP_MICROS(event_timestamp)"

The identifier: "events_*" tells dbt about the wildcard pattern. Freshness checks on the loaded_at_field will alert you if GA4’s export pipeline has stalled — a real failure mode that’s otherwise silent.

Relationship to the Project Structure

The base model feeds everything else. Its output — a properly partitioned, cleaned, typed event table with all parameters extracted — becomes the foundation that both the GA4 Events Sessionized Model and the item-level model reference. Concentration of complexity in one place means downstream models stay readable.

This is base layer philosophy applied to a GA4-specific challenge: concentrate the complexity in one place so every downstream model inherits clean, trustworthy data.