ServicesAboutNotesContact Get in touch →
EN FR
Note

Window Function Patterns for Analytics SQL

Practical window function patterns for analytics SQL — ROW_NUMBER, LEAD/LAG, running totals, session detection, and deduplication

Planted
bigqueryanalyticsdata modeling

Window functions let you compute values across rows that are related to the current row without collapsing the result set. Unlike GROUP BY, which reduces rows, a window function keeps every row and adds a calculated column alongside it. This distinction makes them essential for analytics work where you need both the detail and the context: ranking within groups, comparing adjacent rows, accumulating running totals, and detecting boundaries in sequences.

Every window function has the same anatomy: a function (ROW_NUMBER, SUM, LAG, etc.), an OVER clause, and inside it, an optional PARTITION BY (which groups rows) and ORDER BY (which sequences them within each group). The frame specification (e.g., ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) further controls which rows participate in the calculation.

Deduplication with ROW_NUMBER

The most common window function pattern in analytics engineering is deduplication. Source data arrives with duplicate rows — CDC replays, late-arriving corrections, overlapping incremental loads — and you need exactly one row per entity.

SELECT *
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY updated_at DESC
) AS rn
FROM raw_orders
)
WHERE rn = 1

PARTITION BY order_id creates a separate window per order. ORDER BY updated_at DESC puts the most recent version first. Filtering to rn = 1 keeps only that latest version.

BigQuery supports the QUALIFY clause, which eliminates the subquery wrapper entirely:

SELECT *
FROM raw_orders
QUALIFY ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY updated_at DESC
) = 1

Same execution plan, cleaner code. QUALIFY filters after window functions evaluate, just like HAVING filters after GROUP BY. This pattern appears constantly in dbt incremental models where you need to handle duplicate arrivals during merge operations.

ROW_NUMBER assigns unique sequential integers even when values tie. RANK and DENSE_RANK handle ties differently: RANK leaves gaps (1, 2, 2, 4), DENSE_RANK doesn’t (1, 2, 2, 3). For deduplication, always use ROW_NUMBER — you want exactly one winner per partition regardless of ties.

LEAD and LAG for Row Comparison

LAG looks backward and LEAD looks forward within the ordered partition. They let you compare the current row to its neighbors without self-joins.

The canonical use case is detecting session boundaries from event streams. GA4 exports raw events with no session structure — you build sessions by finding gaps in activity. The GA4 event model requires this kind of reconstruction because sessions don’t exist in the raw data.

WITH events_with_gaps AS (
SELECT
user_pseudo_id,
event_timestamp,
event_name,
LAG(event_timestamp) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
) AS prev_event_timestamp,
CASE
WHEN LAG(event_timestamp) OVER (
PARTITION BY user_pseudo_id ORDER BY event_timestamp
) IS NULL THEN 1
WHEN TIMESTAMP_DIFF(
event_timestamp,
LAG(event_timestamp) OVER (
PARTITION BY user_pseudo_id ORDER BY event_timestamp
),
MINUTE
) > 30 THEN 1
ELSE 0
END AS is_new_session
FROM events
)

The first event for any user has no predecessor, so LAG returns NULL — that marks a new session. Any gap exceeding 30 minutes also marks a new session. This flag becomes the input for the running total pattern described in the next section.

LEAD is useful for calculating time-to-next-action: how long until the user does something else?

SELECT
event_name,
event_timestamp,
TIMESTAMP_DIFF(
LEAD(event_timestamp) OVER (
PARTITION BY session_key ORDER BY event_timestamp
),
event_timestamp,
SECOND
) AS seconds_until_next_event
FROM sessionized_events

Both functions accept an optional offset parameter (LAG(col, 2) looks two rows back) and a default value (LAG(col, 1, 0) returns 0 instead of NULL when there’s no previous row).

Running Totals and Cumulative Aggregation

A running sum of the session boundary flags from the LAG pattern above assigns each event to its session:

SUM(is_new_session) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS session_number

Every time is_new_session = 1, the running sum increments. Events between boundaries share the same cumulative value, which becomes a session identifier. This two-step combination — LAG to detect boundaries, then SUM to assign group numbers — is the standard sessionization technique used in GA4 session building and custom session definitions.

Running totals apply broadly beyond sessions. Cumulative revenue, running user counts, and progressive metric tracking all use the same frame:

SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue
FROM daily_sales

The explicit ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW frame matters. Without it, BigQuery defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, which can produce unexpected results when multiple rows share the same ORDER BY value — the range-based frame includes all ties, while the rows-based frame processes them sequentially.

FIRST_VALUE and LAST_VALUE for Session Attribution

When you need to propagate a value from a specific position across all rows in a partition, FIRST_VALUE and LAST_VALUE are the tools. The most common application is session-level attribution: every event in a session should carry the traffic source from the session’s first event.

FIRST_VALUE(session_source IGNORE NULLS) OVER (
PARTITION BY session_key
ORDER BY event_timestamp
) AS session_source,
FIRST_VALUE(session_medium IGNORE NULLS) OVER (
PARTITION BY session_key
ORDER BY event_timestamp
) AS session_medium,
FIRST_VALUE(landing_page IGNORE NULLS) OVER (
PARTITION BY session_key
ORDER BY event_timestamp
) AS landing_page

IGNORE NULLS is critical. Not every event carries attribution data — only the entry event typically has source and medium populated. Without IGNORE NULLS, the first event might have a NULL source (e.g., if the session_start event was dropped), and every row inherits that NULL.

LAST_VALUE requires more care. Its default frame (RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) means it only sees rows up to the current row, not the entire partition. To get the actual last value in the partition:

LAST_VALUE(event_name IGNORE NULLS) OVER (
PARTITION BY session_key
ORDER BY event_timestamp
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_event_in_session

The explicit UNBOUNDED FOLLOWING frame ensures the function considers the full partition. This is a common source of bugs — LAST_VALUE without a frame specification silently returns the current row’s value, not the partition’s last value.

Gap-and-Island Detection

Identifying consecutive sequences — login streaks, uptime periods, contiguous date ranges — uses a clever ROW_NUMBER subtraction. The insight: subtracting a sequential integer from consecutive values produces the same constant for each unbroken group.

WITH numbered AS (
SELECT
user_id,
activity_date,
DATE_SUB(
activity_date,
INTERVAL ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY activity_date
) DAY
) AS island_id
FROM user_activity
)
SELECT
user_id,
MIN(activity_date) AS streak_start,
MAX(activity_date) AS streak_end,
DATE_DIFF(MAX(activity_date), MIN(activity_date), DAY) + 1 AS streak_days
FROM numbered
GROUP BY user_id, island_id

If a user is active on Jan 1, 2, 3, then skips Jan 4, and returns on Jan 5, 6:

  • Jan 1 minus row 1 = Dec 31 (island A)
  • Jan 2 minus row 2 = Dec 31 (island A)
  • Jan 3 minus row 3 = Dec 31 (island A)
  • Jan 5 minus row 4 = Jan 1 (island B)
  • Jan 6 minus row 5 = Jan 1 (island B)

The first three dates share an island_id; the last two share a different one. GROUP BY the island_id gives you streak boundaries. This runs in O(n) compared to O(n^2) for self-join approaches.

Use cases include SLA uptime tracking, subscription continuity analysis, gaming achievement streaks, and detecting gaps in time-series data that indicate pipeline failures.

Attribution Positioning with ROW_NUMBER and COUNT

Multi-touch attribution models rely on window functions to identify each touchpoint’s position within a conversion journey. The combination of ROW_NUMBER for position and COUNT(*) OVER() for total touches enables any weighting scheme:

SELECT
transaction_id,
channel,
revenue,
ROW_NUMBER() OVER (
PARTITION BY transaction_id ORDER BY touchpoint_timestamp
) AS touch_position,
ROW_NUMBER() OVER (
PARTITION BY transaction_id ORDER BY touchpoint_timestamp DESC
) AS reverse_position,
COUNT(*) OVER (
PARTITION BY transaction_id
) AS total_touches
FROM touchpoints

From these three columns, you can compute any attribution model. First-touch: filter to touch_position = 1. Last-touch: filter to reverse_position = 1. Linear: divide revenue by total_touches. Position-based (40-20-40): use CASE logic on position and total_touches. For detailed implementations of each model, see first-touch, last-touch, and linear attribution and position-based and time-decay models.

Performance: Named Windows and Computation Sharing

When multiple window functions share the same PARTITION BY and ORDER BY, define the window once and reference it by name. BigQuery computes all functions sharing a window specification in a single pass over the data:

SELECT
event_timestamp,
FIRST_VALUE(source IGNORE NULLS) OVER w AS session_source,
FIRST_VALUE(medium IGNORE NULLS) OVER w AS session_medium,
FIRST_VALUE(campaign IGNORE NULLS) OVER w AS session_campaign,
ROW_NUMBER() OVER w AS event_number,
MIN(event_timestamp) OVER w AS session_start
WINDOW w AS (PARTITION BY session_key ORDER BY event_timestamp)

Five window functions, one data scan. Without named windows, writing the full OVER clause five times still produces the same execution plan, but the named syntax makes the intent explicit and the SQL maintainable.

On BigQuery specifically, clustering the table on the PARTITION BY column provides a further optimization. If your window partitions by session_key and the table is clustered by session_key, BigQuery reads each session’s events from physically adjacent storage blocks rather than scanning the full table. For large event tables, this can reduce slot time significantly.

The execution order to remember: FROM, WHERE, GROUP BY, HAVING, window functions, QUALIFY, DISTINCT, ORDER BY, LIMIT. Window functions execute after aggregation but before final filtering, which is why QUALIFY can reference window results directly.