Adrienne Vermorel

Building Session Tables from GA4 Event Data

This is article 3 of 5 in the GA4 BigQuery series. Previous articles covered the complete schema reference and unnesting patterns for every use case.

You now understand GA4’s nested schema and can extract event parameters, items, and user properties. The next challenge is sessionization: adding session context to your events so you can answer questions like “which marketing channel drove this purchase?” or “what did users do before converting?”

Unlike Universal Analytics, which provided pre-aggregated session tables, GA4 exports raw events without session-level structure. Sessionization becomes your responsibility, but that’s actually an advantage. Instead of being locked into Google’s session definition, you control the logic entirely.

This article builds a sessionized events table: every event enriched with session identity, attribution, and position. This approach preserves full event-level detail while making session analysis trivial. Need a session-grain table for dashboards? It’s a simple GROUP BY away. Need to analyze the event sequence before a purchase? The data is already there.

Why event-grain with session context beats session-grain

The obvious approach to sessionization is building a session-grain table: one row per session with aggregated metrics like duration, page views, and revenue. Many tutorials teach this pattern, and it works fine for basic reporting. But it has a fundamental limitation.

Once you aggregate to session grain, you lose event-level detail. The sequence of events, the timing between interactions, and the specific pages viewed all collapse into summary metrics. When a stakeholder asks “what pages do users visit before purchasing?” you’re back to joining the session table with raw events. When they ask “how long after landing do users add to cart?” you need the event timestamps you aggregated away.

The enriched event-grain approach solves this by adding session context as columns on every event. Each row remains an individual event, but now carries its session identity, the session’s traffic source, the landing page, and its position within the session sequence. This enables queries that would require complex joins with session-grain tables:

  • “Show me all events in sessions that converted” → simple WHERE clause
  • “What’s the average time from session start to first add-to-cart?” → direct calculation
  • “Which landing pages lead to the longest session engagement?” → GROUP BY landing_page with event-level timestamps

The trade-off is storage: more columns means a larger table. But BigQuery’s columnar storage and partition pruning minimize the cost impact, and the analytical flexibility far outweighs the incremental storage expense.

The session mart you might need for dashboards becomes trivial to derive. GROUP BY the session key, aggregate the metrics you need, and you’re done. One source of truth, multiple output shapes.

The session key: two fields, not one

The most common mistake in GA4 sessionization is using ga_session_id alone as the session identifier. This field is stored in the event_params array as an integer value, and it represents the Unix timestamp (in seconds) when the session started. The problem is that multiple users can start sessions at the exact same second.

With any reasonable traffic volume, timestamp collisions are guaranteed. Using only ga_session_id groups events from different users into the same “session,” corrupting every downstream metric. Attribution is wrong when a session contains events from users who arrived via different channels.

The correct pattern concatenates user_pseudo_id with ga_session_id to create a truly unique session key:

CONCAT(
user_pseudo_id,
'.',
CAST(
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id')
AS STRING
)
) AS session_key

This session_key becomes the foundation for all window functions. Every session-scoped calculation partitions by this field.

ga_session_id lives in value.int_value, not value.string_value. Extracting it as a string returns nulls.

A few edge cases require handling. Events can arrive without a ga_session_id when users reject consent (both ga_session_id and user_pseudo_id become null), when Measurement Protocol hits lack session context, or when session_start events get filtered in subproperties. Filter these events in your base model:

WHERE (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') IS NOT NULL

Cross-device sessions present a different challenge. The user_pseudo_id is device-specific, so the same person on phone and laptop creates two separate pseudo IDs and therefore two separate sessions that can’t be linked without additional identity resolution. Article 4 covers user stitching for handling this scenario when users authenticate.

Descriptive session columns via window functions

With the session key established, you can add session-level context to every event using window functions. The pattern is consistent: PARTITION BY session_key, optionally ORDER BY event_timestamp when you need first or last values.

Session timestamps establish the session boundaries:

MIN(event_timestamp) OVER (PARTITION BY session_key) AS session_start_ts,
MAX(event_timestamp) OVER (PARTITION BY session_key) AS session_end_ts

Having both timestamps on every event enables duration calculations at query time without re-aggregation. The session duration is (session_end_ts - session_start_ts) / 1000000 in seconds since timestamps are microseconds.

Session attribution requires pulling the traffic source from the first event in the session. GA4 provides the session_traffic_source_last_click struct (available from July 2024 onward), which contains session-scoped attribution matching the GA4 interface:

FIRST_VALUE(session_traffic_source_last_click.manual_campaign.source IGNORE NULLS)
OVER (PARTITION BY session_key ORDER BY event_timestamp) AS session_source,
FIRST_VALUE(session_traffic_source_last_click.manual_campaign.medium IGNORE NULLS)
OVER (PARTITION BY session_key ORDER BY event_timestamp) AS session_medium,
FIRST_VALUE(session_traffic_source_last_click.manual_campaign.campaign_name IGNORE NULLS)
OVER (PARTITION BY session_key ORDER BY event_timestamp) AS session_campaign

The IGNORE NULLS clause ensures you get the first non-null value, handling events that lack attribution data. Wrap these in COALESCE(..., '(direct)') or '(none)' for clean output.

Landing page follows the same pattern:

FIRST_VALUE(
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location')
IGNORE NULLS
) OVER (PARTITION BY session_key ORDER BY event_timestamp) AS landing_page

Device and geography context rarely changes within a session, but using window functions guarantees consistency across all events:

FIRST_VALUE(device.category IGNORE NULLS)
OVER (PARTITION BY session_key ORDER BY event_timestamp) AS session_device_category,
FIRST_VALUE(geo.country IGNORE NULLS)
OVER (PARTITION BY session_key ORDER BY event_timestamp) AS session_country

Event ordering and sequence analysis

Beyond descriptive context, positional columns enable sequence analysis: understanding not just what happened in a session, but in what order. The challenge is that event_timestamp alone doesn’t guarantee correct ordering. Multiple events can share the same timestamp, and the export doesn’t preserve the order they fired on the device.

GA4 provides three fields specifically for deterministic event ordering:

FieldDescription
batch_event_indexSequential order of each event within a batch, based on occurrence order on the device
batch_ordering_idMonotonically increasing number, incremented with each network request from a page
batch_page_idSequential number assigned to each page, increasing across the engagement

These fields appear at the root level of the export schema. Combined, they provide the exact sequence of events as they occurred on the user’s device, resolving timestamp ties deterministically.

Event numbering within the session uses these fields for ordering:

ROW_NUMBER() OVER (
PARTITION BY session_key
ORDER BY event_timestamp, batch_page_id, batch_ordering_id, batch_event_index
) AS event__number_in_session

The compound ORDER BY ensures correct sequencing: first by timestamp for coarse ordering, then by the batch fields to break ties in the exact order events fired.

First and last event flags derive from the event number and total count:

ROW_NUMBER() OVER w_ordered AS event__number_in_session,
COUNT(*) OVER (PARTITION BY session_key) AS session__events

Then in your final SELECT:

event__number_in_session = 1 AS event__is_session_start,
event__number_in_session = session__events AS event__is_session_end

These flags power entry and exit analysis without aggregation. Filter to event__is_session_start = TRUE for landing page analysis, or event__is_session_end = TRUE for exit pages.

Time elapsed within session enables pacing analysis:

(event_timestamp - session_start_ts) / 1000000 AS event__seconds_since_session_start

This answers “how long after landing do users typically purchase?” directly from the event table.

Session-scoped attribution deep dive

GA4’s BigQuery export contains traffic source data in four different locations, each with different scope and use cases. Using the wrong source produces incorrect attribution reports.

The traffic_source struct contains user-scoped, first-touch attribution (the source that originally acquired the user, regardless of current session). Use this for user acquisition analysis, not session attribution.

The collected_traffic_source struct contains event-scoped raw collection data without attribution modeling applied. This is useful for building custom attribution models but requires additional logic to determine session-level values.

The session_traffic_source_last_click struct (July 2024 onward) contains session-scoped attribution with GA4’s last-non-direct model applied. This matches what you see in the GA4 interface and is the recommended source for session attribution.

The event_params array sometimes contains legacy source and medium keys, but these are inconsistently populated and should only be used as a fallback.

For data from July 2024 forward, session_traffic_source_last_click is the clear choice. It handles the complexity of last-non-direct attribution automatically: if a session starts as direct traffic, GA4 looks back up to 90 days for a previous non-direct source and credits that instead.

For historical data before July 2024, you’ll need to build session attribution from collected_traffic_source using the FIRST_VALUE pattern shown earlier. Consider whether maintaining two code paths is worth the complexity, or whether your analysis can start from July 2024.

One known issue affects Google Ads traffic. Sessions with gclid parameters sometimes appear as organic or direct because GA4 doesn’t always decode the click ID to source and medium in the export. If accurate paid search attribution matters, add a correction:

CASE
WHEN collected_traffic_source.gclid IS NOT NULL THEN 'google'
ELSE session_traffic_source_last_click.manual_campaign.source
END AS session_source,
CASE
WHEN collected_traffic_source.gclid IS NOT NULL THEN 'cpc'
ELSE session_traffic_source_last_click.manual_campaign.medium
END AS session_medium

For complete Google Ads campaign data (campaign name, ad group, keywords), consider joining with Google Ads Data Transfer exports using the gclid field.

dbt implementation patterns

The sessionized events model sits in the intermediate layer of your dbt project. It takes cleaned events from base models and adds session context, feeding downstream marts and analysis models.

# dbt_project.yml structure
models:
your_project:
base:
ga4:
+materialized: view
intermediate:
ga4:
+materialized: incremental
marts:
ga4:
+materialized: table

For the incremental configuration, insert_overwrite with date partitioning performs best on BigQuery:

{{
config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
'field': 'event_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by=['session_key']
)
}}

Clustering by session_key optimizes window function performance. Since every window partitions by this field, BigQuery can process each session’s events together without scanning unrelated data.

The lookback window deserves careful consideration. GA4 updates export tables for up to 72 hours after the event date, so you need to reprocess recent partitions on each run. But window functions need all events in a session to calculate correctly. If a session spans two days and you only reprocess today, the window functions for yesterday’s portion of that session become stale.

A 7-day lookback provides a safe margin:

{% if is_incremental() %}
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
{% endif %}

Sessions rarely span more than a few hours, so 7 days captures any reasonable session completely while limiting reprocessing cost.

To keep the model readable, create a macro for repeated window patterns:

-- macros/ga4/session_first_value.sql
{% macro session_first_value(column, partition_key='session_key', order_key='event_timestamp') %}
FIRST_VALUE({{ column }} IGNORE NULLS) OVER (
PARTITION BY {{ partition_key }}
ORDER BY {{ order_key }}
)
{% endmacro %}

Usage in your model:

{{ session_first_value('session_traffic_source_last_click.manual_campaign.source') }} AS session_source

Why your numbers won’t match the GA4 UI

Expect 1-5% variance between BigQuery queries and GA4 interface reports. This is normal and shouldn’t trigger extensive reconciliation efforts.

Several factors contribute to the discrepancy. The GA4 interface uses HyperLogLog++ for probabilistic counting, which trades perfect accuracy for speed. BigQuery returns exact counts. For large datasets, the difference is noticeable.

Consent Mode behavioral modeling presents a bigger gap. GA4 models behavior for users who reject tracking, estimating conversions and engagement based on consenting users’ patterns. This modeled data appears in GA4 reports but never reaches BigQuery exports.

Google Signals enables cross-device user deduplication in the GA4 interface for users signed into Google accounts. BigQuery sees each device as a separate user_pseudo_id, so the same person on phone and desktop counts as two users.

Data processing delays mean events arrive up to 72 hours after occurrence. Comparing data less than 72 hours old will show discrepancies.

The practical approach: document the expected variance, compare data older than 72 hours, and use APPROX_COUNT_DISTINCT() when you need to mimic GA4’s counting behavior. Perfect reconciliation is not achievable given the architectural differences.

Performance considerations

Partition pruning has the largest impact on query cost and speed. Always filter on _TABLE_SUFFIX in your base model or source query:

WHERE _TABLE_SUFFIX >= FORMAT_DATE('%Y%m%d', DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY))

This tells BigQuery to scan only the specified date partitions. Without this filter, every query scans your entire event history. On a large property, that’s the difference between scanning 5 GB and 500 GB (a 100x cost difference).

Clustering by session_key optimizes window function execution. BigQuery physically organizes data so events from the same session are stored together, minimizing I/O during partition-by operations.

Window functions add compute cost, but you can minimize it by combining windows with identical PARTITION BY and ORDER BY clauses. BigQuery can compute multiple window functions in a single pass over the data when the frame specification matches:

-- These compute in a single pass
FIRST_VALUE(source IGNORE NULLS) OVER w AS session_source,
FIRST_VALUE(medium IGNORE NULLS) OVER w AS session_medium,
FIRST_VALUE(campaign IGNORE NULLS) OVER w AS session_campaign,
MIN(event_timestamp) OVER w AS session_start_ts
-- Define the window once
WINDOW w AS (PARTITION BY session_key ORDER BY event_timestamp)

This named window syntax keeps the model readable while ensuring optimal execution.

From enriched events to session mart

The sessionized events table makes deriving a session-grain mart trivial. All session context already exists on every row, so aggregation requires no joins:

-- models/marts/mrt__ga4__sessions.sql
SELECT
session_key,
-- Dimensions (constant per session, use ANY_VALUE)
ANY_VALUE(user_pseudo_id) AS user_pseudo_id,
ANY_VALUE(session_source) AS session__source,
ANY_VALUE(session_medium) AS session__medium,
ANY_VALUE(session_campaign) AS session__campaign,
ANY_VALUE(landing_page) AS session__landing_page,
ANY_VALUE(session_device_category) AS session__device_category,
ANY_VALUE(session_country) AS session__country,
-- Timestamps
MIN(event_timestamp) AS session__started_at,
MAX(event_timestamp) AS session__ended_at,
-- Engagement metrics
COUNT(*) AS session__events,
COUNTIF(event_name = 'page_view') AS session__page_views,
COUNT(DISTINCT (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location')) AS session__unique_pages,
MAX((SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'session_engaged')) = '1' AS session__is_engaged,
-- Conversions
COUNTIF(event_name = 'purchase') AS session__purchases,
SUM(IF(event_name = 'purchase', ecommerce.purchase_revenue, 0)) AS session__total_revenue
FROM {{ ref('int__ga4__events_sessionized') }}
GROUP BY session_key

This mart can be a table for dashboard performance or a view for always-fresh data. Either way, the logic is simple and maintainable.

The pattern’s power becomes clear when requirements change. Need to add a new session-level metric? Add it to the mart’s SELECT. Need to change how sessions are identified or attributed? Change the intermediate model once, and the mart inherits the fix automatically. One source of truth, multiple output shapes.

The complete sessionized events model

Here’s the full implementation bringing together all the patterns discussed:

-- models/intermediate/int__ga4__events_sessionized.sql
{{
config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
'field': 'event_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by=['session_key']
)
}}
WITH events AS (
SELECT
event_date,
event_timestamp,
event_name,
user_pseudo_id,
event_params,
user_properties,
ecommerce,
items,
device,
geo,
traffic_source,
collected_traffic_source,
session_traffic_source_last_click,
-- Batch ordering fields for deterministic event sequencing
batch_page_id,
batch_ordering_id,
batch_event_index,
-- Extract ga_session_id once for reuse
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id
FROM {{ ref('base__ga4__events') }}
{% if is_incremental() %}
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)
{% endif %}
),
with_session_key AS (
SELECT
*,
CONCAT(user_pseudo_id, '.', CAST(ga_session_id AS STRING)) AS session_key
FROM events
WHERE ga_session_id IS NOT NULL
),
sessionized AS (
SELECT
-- Original event fields
event_date,
event_timestamp,
event_name,
user_pseudo_id,
ga_session_id,
session_key,
batch_page_id,
batch_ordering_id,
batch_event_index,
event_params,
user_properties,
ecommerce,
items,
device,
geo,
traffic_source,
collected_traffic_source,
session_traffic_source_last_click,
-- Session timestamps
MIN(event_timestamp) OVER w AS session__started_at,
MAX(event_timestamp) OVER w AS session__ended_at,
-- Session attribution
COALESCE(
FIRST_VALUE(session_traffic_source_last_click.manual_campaign.source IGNORE NULLS) OVER w_ordered,
'(direct)'
) AS session__source,
COALESCE(
FIRST_VALUE(session_traffic_source_last_click.manual_campaign.medium IGNORE NULLS) OVER w_ordered,
'(none)'
) AS session__medium,
FIRST_VALUE(session_traffic_source_last_click.manual_campaign.campaign_name IGNORE NULLS) OVER w_ordered AS session__campaign,
-- Landing page
FIRST_VALUE(
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location')
IGNORE NULLS
) OVER w_ordered AS session__landing_page,
-- Device and geo
FIRST_VALUE(device.category IGNORE NULLS) OVER w_ordered AS session__device_category,
FIRST_VALUE(geo.country IGNORE NULLS) OVER w_ordered AS session__country,
-- Positional: event numbering
ROW_NUMBER() OVER w_ordered AS event__number_in_session,
COUNT(*) OVER w AS session__events,
-- Positional: time elapsed
event_timestamp - MIN(event_timestamp) OVER w AS event__microseconds_since_session_start
FROM with_session_key
WINDOW
w AS (PARTITION BY session_key),
w_ordered AS (PARTITION BY session_key ORDER BY event_timestamp, batch_page_id, batch_ordering_id, batch_event_index)
)
SELECT
*,
-- Derived positional flags
event__number_in_session = 1 AS event__is_session_start,
event__number_in_session = session__events AS event__is_session_end,
event__microseconds_since_session_start / 1000000 AS event__seconds_since_session_start
FROM sessionized

Test the model with these schema tests:

models/intermediate/schema.yml
models:
- name: int__ga4__events_sessionized
columns:
- name: session_key
tests:
- not_null
- name: event_timestamp
tests:
- not_null
- name: event__number_in_session
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 1
- name: session__source
tests:
- not_null

Conclusion

The sessionized events table you’ve built here becomes the foundation for everything else. A session-grain mart for dashboards is just a GROUP BY on the session key. Event sequence analysis before conversion uses position and timing data already on every row. Filtering to sessions from a specific campaign is a WHERE clause.

This works because you’re adding session context to events rather than aggregating events away into sessions. You keep the detail and gain the flexibility to slice it however the question demands.

One limitation remains: these sessions are device-bound. When someone browses on their phone and later purchases on their laptop, you see two separate sessions from two different “users” with no connection between them. The next article addresses this with user stitching, connecting anonymous browsing history to known identities when users authenticate.