ServicesAboutNotesContact Get in touch →
EN FR
Note

GA4 User Mart Pattern

Building a user-grain mart from GA4 session data — first/last touch attribution, lifetime value aggregation, and identity stitching with user_pseudo_id and user_id.

Planted
ga4dbtbigquerydata modelinganalytics

The user mart answers the question that session-level analysis can’t: what does a user’s complete journey across all their sessions look like? First acquisition channel, lifetime revenue, recency, session count — these require aggregating session data to user grain and resolving the identity problem that comes with GA4’s device-bound identifiers.

The Identity Problem

GA4’s user_pseudo_id is a device-level identifier — a first-party cookie tied to the browser or app instance. The same person using their phone and laptop creates two user_pseudo_id values. Sessions from different devices are separate in GA4’s data model.

When users authenticate (login, purchase, signup), GA4 can record a user_id alongside the user_pseudo_id. This authenticated identifier provides the bridge between devices. A user who browses on mobile and purchases on desktop appears as two anonymous device IDs until authentication links them.

The user mart handles two scenarios:

  1. Unauthenticated users: Only user_pseudo_id. Each device is a separate “user” in the mart.
  2. Authenticated users: A user_id can be stitched across multiple user_pseudo_id values.

The Model

-- models/marts/ga4/mrt__analytics__users.sql
{{
config(
materialized='table',
cluster_by=['user__first_seen_at']
)
}}
WITH user_sessions AS (
SELECT
user__pseudo_id,
user__id,
session__key,
event__date,
session__started_at,
session__channel_grouping,
device__category,
geo__country,
session__has_purchase,
session__revenue
FROM {{ ref('mrt__analytics__sessions') }}
),
user_first_last AS (
SELECT
user__pseudo_id,
-- First seen
MIN(event__date) AS user__first_seen_at,
FIRST_VALUE(session__channel_grouping) OVER (
PARTITION BY user__pseudo_id
ORDER BY event__date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS user__first_channel,
FIRST_VALUE(device__category) OVER (
PARTITION BY user__pseudo_id
ORDER BY event__date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS user__first_device,
-- Last seen
MAX(event__date) AS user__last_seen_at,
LAST_VALUE(session__channel_grouping) OVER (
PARTITION BY user__pseudo_id
ORDER BY event__date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS user__last_channel,
-- Aggregates
COUNT(DISTINCT session__key) AS user__sessions,
MAX(session__has_purchase) AS user__has_purchased,
SUM(session__revenue) AS user__lifetime_revenue
FROM user_sessions
GROUP BY user__pseudo_id
),
-- Stitch authenticated identity
user_identity AS (
SELECT DISTINCT
user__pseudo_id,
FIRST_VALUE(user__id IGNORE NULLS) OVER (
PARTITION BY user__pseudo_id
ORDER BY event__date DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS user__stitched_id
FROM user_sessions
WHERE user__id IS NOT NULL
)
SELECT
u.user__pseudo_id,
i.user__stitched_id AS user__id,
u.user__first_seen_at,
u.user__last_seen_at,
u.user__first_channel,
u.user__first_device,
u.user__last_channel,
u.user__sessions,
u.user__has_purchased,
u.user__lifetime_revenue,
-- Derived metrics
DATE_DIFF(u.user__last_seen_at, u.user__first_seen_at, DAY)
AS user__lifespan_days,
DATE_DIFF(CURRENT_DATE(), u.user__last_seen_at, DAY)
AS user__days_since_last_visit
FROM user_first_last u
LEFT JOIN user_identity i ON u.user__pseudo_id = i.user__pseudo_id

The Identity Stitching Logic

The user_identity CTE implements a simple form of user backstitching:

FIRST_VALUE(user__id IGNORE NULLS) OVER (
PARTITION BY user__pseudo_id
ORDER BY event__date DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS user__stitched_id

ORDER BY event__date DESC with FIRST_VALUE means “take the most recent user_id associated with this user_pseudo_id.” IGNORE NULLS skips sessions where the user wasn’t authenticated. The DISTINCT in the outer query deduplicates the result to one row per user_pseudo_id.

This approach handles the case where a user authenticates with a user_id partway through their history — all their sessions, even earlier anonymous ones, get the stitched identity.

The LEFT JOIN back to user_first_last ensures users who never authenticated still appear in the mart, with a null user__id. You’re not losing unauthenticated users — they just don’t have a stitched identifier.

First and Last Touch Attribution

The window functions on user__first_channel and user__last_channel provide the two most common attribution questions at user grain:

  • user__first_channel: What channel originally acquired this user? Useful for acquisition cohort analysis — how do users from Paid Search compare to Organic over their lifetime?
  • user__last_channel: What channel drove the most recent visit? Useful for re-engagement analysis.

Both use FIRST_VALUE and LAST_VALUE respectively, with explicit full-partition framing. The pattern mirrors the session-level landing/exit page logic from the sessionized model.

Note: these are first-touch and last-touch at the session level — the first/last session channel, not the first/last event UTM parameter. For finer-grained attribution, you’d need to look at event-level sources from the sessionized table.

Derived Metrics

Two derived columns make the mart immediately useful for retention analysis:

DATE_DIFF(u.user__last_seen_at, u.user__first_seen_at, DAY)
AS user__lifespan_days

Users with user__lifespan_days = 0 are one-session users — they visited once and never returned. High lifespan with low session count suggests infrequent but loyal users. High lifespan with high session count is your engaged core.

DATE_DIFF(CURRENT_DATE(), u.user__last_seen_at, DAY)
AS user__days_since_last_visit

This is recency in an RFM (Recency, Frequency, Monetary) framework. Combined with user__sessions (frequency) and user__lifetime_revenue (monetary), the mart has the three components for RFM segmentation without any additional joins.

Limitations

One mart per user_pseudo_id, not per person. The mart groups by user_pseudo_id, which is device-bound. A single person using three devices appears as three rows unless they authenticated on all three. Cross-device identity resolution beyond user_id stitching requires probabilistic matching or first-party data not available in GA4 exports.

The mart snapshots current state. With materialized='table', the mart rebuilds entirely on each run. Lifetime metrics are always current. If you need point-in-time snapshots (what was a user’s lifetime value on a specific date?), you’d need to add date-based incremental logic or maintain history via dbt snapshots.

No SCD tracking. User attributes like user__first_channel are computed from historical data and stable. But user__last_channel and user__days_since_last_visit change on every run. If attribution changes are important to track, implement a Type 2 SCD via dbt snapshots.