ServicesAboutNotesContact Get in touch →
EN FR
Note

Custom Sessionization Patterns

How to build custom session definitions from raw events using LAG and running sums, with configurable timeouts, campaign-based splits, and session metrics.

Planted
bigqueryga4analyticsdata modeling

GA4 defines sessions with a 30-minute inactivity timeout. That’s a reasonable default, but your business might need something different. High-engagement apps might need 15-minute timeouts to distinguish browsing sessions from purchase sessions. Content sites might want 45-minute timeouts because readers take breaks mid-article. Some teams need campaign-based session breaks where new UTM parameters always trigger a new session, regardless of timing.

Custom sessionization gives you full control over what constitutes a “session” in your data. The technique uses window functions — specifically LAG for gap detection and a running SUM for session numbering — to build sessions from raw event-level data.

The Core Pattern: GAP Detection + Running Sum

The pattern works in two steps. First, LAG identifies session boundaries by comparing each event’s timestamp to the previous event. Then a running sum of the boundary flags assigns session numbers.

WITH events_with_gaps AS (
SELECT
user_pseudo_id,
event_timestamp,
event_name,
traffic_source.source AS traffic__source,
traffic_source.medium AS traffic__medium,
CASE
WHEN LAG(event_timestamp) OVER (
PARTITION BY user_pseudo_id ORDER BY event_timestamp
) IS NULL THEN 1 -- First event = new session
WHEN TIMESTAMP_DIFF(
event_timestamp,
LAG(event_timestamp) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp),
MINUTE
) > 30 THEN 1 -- Gap > 30 min = new session
ELSE 0
END AS event__is_new_session
FROM `project.analytics_XXXXX.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20250101' AND '20250131'
),
sessionized AS (
SELECT
user_pseudo_id,
event_timestamp,
event_name,
traffic__source,
traffic__medium,
event__is_new_session,
SUM(event__is_new_session) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
) AS session_number
FROM events_with_gaps
)
SELECT
user_pseudo_id,
event_timestamp,
event_name,
traffic__source,
traffic__medium,
session_number,
CONCAT(user_pseudo_id, '_', session_number) AS custom_session_id
FROM sessionized;

The first CTE identifies boundaries: the very first event for a user (where LAG returns NULL) and any event that follows more than 30 minutes of inactivity. The second CTE uses SUM(event__is_new_session) as a running total — every time the flag is 1, the session number increments by one. Events between boundaries share the same cumulative value, which becomes the session identifier.

The CONCAT(user_pseudo_id, '_', session_number) creates a globally unique session ID. This is critical — session_number alone is not unique across users.

Configuring the Timeout

The 30-minute timeout is just a constant. Change it to match your business:

-- 15-minute timeout for high-engagement apps
WHEN TIMESTAMP_DIFF(..., MINUTE) > 15 THEN 1
-- 45-minute timeout for content-heavy sites
WHEN TIMESTAMP_DIFF(..., MINUTE) > 45 THEN 1
-- 2-hour timeout for B2B platforms with long workflows
WHEN TIMESTAMP_DIFF(..., HOUR) > 2 THEN 1

How to pick the right timeout: look at the distribution of inter-event gaps in your data. Plot a histogram of time between consecutive events per user. You’ll typically see a bimodal distribution — a cluster of short gaps (within-session activity) and a cluster of long gaps (between sessions). Set your timeout at the valley between those clusters.

-- Analyze inter-event gap distribution
SELECT
CASE
WHEN gap_minutes <= 5 THEN '0-5 min'
WHEN gap_minutes <= 15 THEN '5-15 min'
WHEN gap_minutes <= 30 THEN '15-30 min'
WHEN gap_minutes <= 60 THEN '30-60 min'
WHEN gap_minutes <= 120 THEN '60-120 min'
ELSE '120+ min'
END AS gap_bucket,
COUNT(*) AS event_count
FROM (
SELECT
TIMESTAMP_DIFF(
event_timestamp,
LAG(event_timestamp) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp),
MINUTE
) AS gap_minutes
FROM events
)
WHERE gap_minutes IS NOT NULL
GROUP BY 1
ORDER BY 1;

If the data shows a clear drop-off between 20 and 40 minutes, a 30-minute timeout makes sense. If activity clusters tightly under 10 minutes with a long tail, 15 minutes might be more appropriate.

Campaign-Based Session Splitting

Sometimes timing isn’t the only session boundary. A user who clicks a new ad (different UTM parameters) is arguably starting a new session even if it’s been only 5 minutes since their last event. This matters for attribution modeling because it determines which session gets credit for a conversion.

Add campaign-based splitting to the boundary detection:

CASE
WHEN LAG(event_timestamp) OVER (w) IS NULL THEN 1
WHEN TIMESTAMP_DIFF(event_timestamp, LAG(event_timestamp) OVER (w), MINUTE) > 30 THEN 1
-- New UTM source/medium = new session
WHEN traffic_source.source != LAG(traffic_source.source) OVER (w)
OR traffic_source.medium != LAG(traffic_source.medium) OVER (w)
THEN 1
ELSE 0
END AS event__is_new_session

Be careful with NULL handling here. If traffic_source.source is NULL for organic traffic, the != comparison returns NULL (not TRUE). Use IFNULL or COALESCE to handle missing attribution data:

WHEN IFNULL(traffic_source.source, 'none') != IFNULL(LAG(traffic_source.source) OVER (w), 'none') THEN 1

Session Metrics

Once you have sessions, aggregate event-level data into session-level metrics in a subsequent CTE:

, session_metrics AS (
SELECT
custom_session_id,
user_pseudo_id,
MIN(event_timestamp) AS session__started_at,
MAX(event_timestamp) AS session__ended_at,
TIMESTAMP_DIFF(MAX(event_timestamp), MIN(event_timestamp), SECOND) AS session__duration_seconds,
COUNT(*) AS session__events,
COUNTIF(event_name = 'page_view') AS session__pageviews
FROM sessionized
GROUP BY custom_session_id, user_pseudo_id
)

Common session-level metrics to compute:

  • Duration: TIMESTAMP_DIFF(MAX, MIN, SECOND) — note this is 0 for single-event sessions, which is a GA4 quirk too
  • Pageviews: COUNTIF(event_name = 'page_view')
  • Engagement: Define your own rule (e.g., duration > 10 seconds OR pageviews > 1 OR conversion event present)
  • Bounce: The inverse of your engagement definition
  • Entry page: Use FIRST_VALUE(page_location) OVER (PARTITION BY session_id ORDER BY event_timestamp) before aggregating
  • Exit page: Same with LAST_VALUE and an explicit ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING frame

Why Not Just Use ga_session_id?

GA4’s ga_session_id (extracted from event_params) works for many cases, but it has limitations:

  1. It uses Google’s 30-minute timeout, which you can’t change
  2. It resets at midnight in the user’s timezone in some implementations, splitting sessions that span midnight
  3. It’s not globally unique — multiple users starting sessions at the same second share the same ga_session_id
  4. You can’t add custom boundary rules like campaign-based splitting

If GA4’s session definition matches your needs, use CONCAT(user_pseudo_id, '.', ga_session_id) as your session key and skip custom sessionization. Build your own only when the business logic demands it.

Cross-Device Session Stitching

When users have a user_id (logged-in state), you can build sessions across devices:

PARTITION BY COALESCE(user_id, user_pseudo_id) ORDER BY event_timestamp

This groups events by authenticated user when available, falling back to anonymous user when not. The tradeoff: anonymous sessions for the same person on different devices remain separate (you can’t stitch what you can’t identify), and the moment a user logs in, their pre-login events on that device join the authenticated session.

For more sophisticated identity resolution — matching anonymous sessions to known users retroactively — you need a separate identity graph, which is beyond what sessionization alone can handle.

Production Considerations

Materialization: Custom sessionization queries are expensive because they scan all events within the date range and compute window functions across them. Materialize as an incremental dbt model with insert_overwrite on the event date partition. Reprocess the last 2-3 days to handle late-arriving events.

Testing: Validate that your custom sessions produce reasonable metrics. Compare session counts, average duration, and bounce rates against GA4’s built-in session metrics. Large discrepancies (>20%) suggest a timeout or boundary rule that doesn’t match user behavior.

Naming: Distinguish your custom sessions from GA4 sessions in column names. Use custom_session_id instead of session_id, and document the timeout and boundary rules in your model’s YAML description. Future you will thank present you when debugging a dashboard discrepancy.