Adrienne Vermorel

10 BigQuery SQL Patterns Every Analytics Engineer Should Know

Most analytics engineers learn BigQuery through trial and error, picking up patterns ad hoc from Stack Overflow answers and inherited codebases. It works, but it’s slow and expensive.

After auditing hundreds of dbt models and years of production BigQuery work, I’ve identified a canon of patterns that separate functional SQL from elegant, cost-efficient SQL. These are fundamental techniques that compound over time, the kind of knowledge that makes you faster at solving new problems because you’ve internalized the right primitives.

This article covers 10 patterns across 5 categories: query optimization, nested data handling, window functions, dbt incremental strategies, and marketing analytics. Each pattern includes production-ready code, anti-patterns to avoid, and real performance benchmarks.


Query Optimization & Cost Awareness

These three patterns directly impact your BigQuery bill and query performance. Master them first.

Pattern 1: Partition + Clustering Combo

Partitioning divides your table into segments by date, timestamp, or integer range. Clustering sorts data within those partitions by up to four columns. Combined, they let BigQuery skip irrelevant data blocks entirely, turning full table scans into targeted reads.

-- Create a table with both strategies
CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_name
AS
SELECT
event_timestamp,
user_id,
event_name,
event_params
FROM raw.events;
-- Enforce partition filters at the table level
ALTER TABLE `project.dataset.events`
SET OPTIONS (require_partition_filter = true);
-- Query that triggers both partition and cluster pruning
SELECT user_id, event_name, COUNT(*)
FROM `project.dataset.events`
WHERE DATE(event_timestamp) = '2025-01-01' -- Partition pruning
AND user_id = 'abc123' -- Cluster pruning
GROUP BY 1, 2;

The require_partition_filter option prevents anyone from accidentally running a full table scan. Queries without a partition filter simply fail.

What to avoid:

Using functions on partition columns defeats pruning. If your table is partitioned by TIMESTAMP, the filter WHERE DATE(event_timestamp) = '2025-01-01' forces a full scan because BigQuery can’t evaluate the function at planning time. Instead, store a separate DATE column and partition on that, or use a range filter: WHERE event_timestamp >= '2025-01-01' AND event_timestamp < '2025-01-02'.

Subqueries in partition filters also prevent pruning. WHERE event_date = (SELECT MAX(event_date) FROM other_table) won’t prune because BigQuery needs literal values at query planning time. Materialize the date first or use a variable.

Over-partitioning creates its own problems. If your partitions contain less than 1GB each, the metadata overhead outweighs the pruning benefits. For small tables, clustering alone often performs better.

Performance impact: Partitioning alone delivers 10–100x cost reduction on large tables. Adding clustering provides another 2–10x reduction within partitions. On a 10TB table with proper filters, you might scan just 10–100GB.


Pattern 2: Materialized Views for Repeated Aggregations

Dashboards and reports often run the same heavy aggregations hourly. Each execution scans the same data, burns the same slots, costs the same money. Materialized views precompute those results and refresh incrementally.

BigQuery’s optimizer uses MVs transparently. Queries against the base table get automatically rewritten to use the MV when beneficial, with no code changes required.

CREATE MATERIALIZED VIEW `project.dataset.mv_daily_revenue`
OPTIONS (
enable_refresh = true,
refresh_interval_minutes = 30,
max_staleness = INTERVAL '4' HOUR -- Accept staleness for dashboards
)
AS
SELECT
DATE(transaction_time) AS transaction_date,
product_category,
SUM(amount) AS total_revenue,
COUNT(*) AS transaction_count
FROM `project.dataset.transactions`
GROUP BY 1, 2;

The max_staleness option is the key tuning lever. Setting it to 4 hours tells BigQuery “I’m okay with data that’s up to 4 hours old”, which lets the optimizer use the MV more aggressively and reduces refresh frequency. For real-time dashboards, omit it. For weekly reports, set it to days.

What to avoid:

Non-deterministic functions like CURRENT_TIMESTAMP() in your MV definition block incremental refresh. The MV falls back to full recomputation, defeating the purpose.

MVs also can’t include ORDER BY, LIMIT, HAVING, or subqueries in the SELECT clause. Keep them simple: aggregations over filtered base tables.

Don’t ignore staleness settings. The default is always-fresh, which means constant refresh overhead. If your dashboard users can tolerate 30-minute-old data, tell BigQuery.

Performance impact: Google reports queries scanning 500GB reduced to 2.5MB (a 200,000x reduction) with properly designed MVs. Query speedup ranges from 5-100x for complex aggregations. Limits: 20 MVs per dataset, 100 per project.


Pattern 3: HyperLogLog++ Sketches for Composable Distinct Counts

Suppose you need unique user counts for 7-day, 30-day, and 90-day rolling windows. With COUNT(DISTINCT user_id), each window requires scanning all the underlying data. Triple the windows, triple the cost.

HyperLogLog++ sketches solve this. They’re probabilistic data structures that estimate cardinality with ~97-99% accuracy, and they’re additive. Precompute daily sketches, then merge them for any date range in milliseconds.

-- Step 1: Create daily sketches (run once, append daily)
CREATE OR REPLACE TABLE `project.dataset.daily_user_sketches` AS
SELECT
event_date,
HLL_COUNT.INIT(user_id, 15) AS user_sketch -- Precision 10–24
FROM `project.dataset.events`
GROUP BY event_date;
-- Step 2: Merge sketches for any date range (instant, regardless of range)
SELECT HLL_COUNT.MERGE(user_sketch) AS unique_users_30d
FROM `project.dataset.daily_user_sketches`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);
-- Simpler alternative for one-off queries
SELECT APPROX_COUNT_DISTINCT(user_id) AS approx_users
FROM `project.dataset.events`
WHERE event_date >= '2025-01-01';

The precision parameter (10-24) controls the tradeoff between accuracy and sketch size. Precision 15 gives approximately 0.4% error, which is good enough for most analytics use cases. GA4 uses precision 14 for user counts and 12 for sessions internally.

What to avoid:

APPROX_COUNT_DISTINCT is convenient but not composable. You can’t add yesterday’s approximate count to today’s and get a valid 2-day count. For rolling windows, you must use HLL_COUNT.INIT and HLL_COUNT.MERGE.

Low precision values (<12) when accuracy matters will produce noticeable variance in results.

Performance impact: 4-8x faster execution than exact COUNT(DISTINCT), 8x lower slot consumption. The real win is rolling windows: a 90-day unique user count that took 30 seconds now takes 200 milliseconds.


Nested Data Mastery

BigQuery’s nested and repeated fields are powerful but poorly understood. This pattern is essential for anyone working with GA4, Firebase, or event data.

Pattern 4: Inline UNNEST Subqueries

GA4 and Firebase store event parameters as arrays of structs (a schema like event_params ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, ...>>>). The naive approach uses CROSS JOIN UNNEST, which explodes rows and breaks when parameters are missing.

The correct approach uses inline UNNEST subqueries. They extract values cleanly, return NULL for missing parameters, and keep one row per event.

-- The right way: inline UNNEST subqueries
SELECT
event_date,
event_name,
user_pseudo_id,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location,
(SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id,
(SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'campaign') AS campaign -- NULL if missing
FROM `project.analytics_XXXXX.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20250101' AND '20250131'
AND event_name = 'page_view';

For repeated use, wrap the extraction in a temporary function:

CREATE TEMP FUNCTION GetParamString(params ANY TYPE, param_key STRING)
AS ((SELECT value.string_value FROM UNNEST(params) WHERE key = param_key));
CREATE TEMP FUNCTION GetParamInt(params ANY TYPE, param_key STRING)
AS ((SELECT value.int_value FROM UNNEST(params) WHERE key = param_key));
SELECT
event_name,
GetParamString(event_params, 'page_location') AS page_location,
GetParamInt(event_params, 'ga_session_id') AS ga_session_id
FROM `project.analytics_XXXXX.events_*`
WHERE _TABLE_SUFFIX = '20250115';

What to avoid:

Multiple CROSS JOIN UNNEST statements for optional parameters will return zero rows if any parameter is missing on any event.

Forgetting the _TABLE_SUFFIX filter on date-sharded tables scans the entire dataset, potentially years of data. Always filter it first.

Using SAFE_OFFSET or array indexing assumes parameters appear in a consistent order. They don’t. Always filter by key.

Performance impact: Nested data queries avoid shuffle operations, making them faster than equivalent JOINs on flattened tables. Storage savings reach 35-74% compared to fully denormalized schemas.


Window Function Patterns

Window functions are BigQuery’s secret weapon for analytics. These two patterns unlock elegant solutions to common problems.

Pattern 5: QUALIFY for Elegant Window Filtering

Filtering on window function results (like “get the most recent order per customer”) traditionally requires a subquery or CTE:

-- The old way: subquery wrapper
SELECT
customer_id,
order_id,
order_date,
order__status
FROM (
SELECT
customer_id,
order_id,
order_date,
order__status,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn
FROM `project.dataset.base__shopify__orders`
)
WHERE rn = 1;

BigQuery’s QUALIFY clause does this inline:

-- The elegant way: QUALIFY
SELECT
customer_id,
order_id,
order_date,
order__status
FROM `project.dataset.base__shopify__orders`
QUALIFY ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY order_date DESC
) = 1;

Same result, less nesting, easier to read. QUALIFY works with any window function:

-- Top 3 products per category by revenue
SELECT
product__category,
product__name,
product__revenue
FROM `project.dataset.mrt__sales__products`
QUALIFY RANK() OVER (PARTITION BY product__category ORDER BY product__revenue DESC) <= 3;
-- First and last event per session
SELECT
session_id,
event_name,
event_time,
user_id
FROM `project.dataset.int__ga4__events_sessionized`
QUALIFY ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY event_time) = 1
OR ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY event_time DESC) = 1;

What to avoid:

Wrapping simple window filters in CTEs when QUALIFY would suffice. For complex multi-step transformations, CTEs are still appropriate, but for single-step filtering, QUALIFY is cleaner.

Performance impact: The execution plan is identical to the subquery approach. QUALIFY is pure syntactic sugar, but it makes your code significantly more readable.

The execution order is: FROM → WHERE → GROUP BY → HAVING → WINDOW → QUALIFY → DISTINCT → ORDER BY → LIMIT.


Pattern 6: Gap-and-Island Detection for Streaks

Identifying consecutive sequences (user login streaks, production uptime, date ranges without gaps) is surprisingly tricky. The gap-and-island pattern uses a clever ROW_NUMBER subtraction to group consecutive values.

The insight is that subtracting a sequential row number from consecutive dates produces the same result for each consecutive group, which becomes a grouping key.

WITH numbered_activity AS (
SELECT
user_id,
activity_date,
-- Subtract row number from date; consecutive dates yield same value
DATE_SUB(activity_date, INTERVAL ROW_NUMBER() OVER (
PARTITION BY user_id
ORDER BY activity_date
) DAY) AS island_id
FROM `project.dataset.user_activity`
)
SELECT
user_id,
MIN(activity_date) AS streak__started_at,
MAX(activity_date) AS streak__ended_at,
DATE_DIFF(MAX(activity_date), MIN(activity_date), DAY) + 1 AS streak__duration_days
FROM numbered_activity
GROUP BY user_id, island_id
HAVING DATE_DIFF(MAX(activity_date), MIN(activity_date), DAY) + 1 >= 7; -- Only 7+ day streaks

Example data showing the island calculation:

activity_dateROW_NUMBERdate - row_number (island_id)
2025-01-0112024-12-31
2025-01-0222024-12-31
2025-01-0332024-12-31
2025-01-0542025-01-01 (gap creates new island)
2025-01-0652025-01-01

The first three dates form one island (streak of 3). The last two form another (streak of 2).

What to avoid:

Self-joins for streak detection have O(n²) complexity. The window function approach is O(n).

Recursive CTEs might seem natural for this problem, but BigQuery doesn’t support them. The gap-and-island pattern is the standard solution.

Use cases: Gaming achievement streaks, habit tracking apps, SLA uptime calculation, consolidating overlapping date ranges, identifying gaps in time series data.


dbt Incremental Strategies

dbt on BigQuery offers multiple incremental strategies. Choosing wrong means either correctness issues or 10x higher costs than necessary.

Pattern 7: insert_overwrite with Static Partitions

The default merge strategy works by scanning the entire destination table to find matching rows. For append-heavy event tables that rarely update, this is wasteful. insert_overwrite replaces entire partitions with no row matching required.

{% set partitions_to_replace = [
'current_date()',
'date_sub(current_date(), interval 1 day)',
'date_sub(current_date(), interval 2 day)'
] %}
{{
config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={
'field': 'event_date',
'data_type': 'date'
},
partitions=partitions_to_replace,
cluster_by=['user_id', 'event_name'],
copy_partitions=true
)
}}
SELECT
DATE(event_timestamp) AS event_date,
user_id,
event_name,
event_params
FROM {{ source('analytics', 'raw_events') }}
{% if is_incremental() %}
WHERE DATE(event_timestamp) IN ({{ partitions_to_replace | join(', ') }})
{% endif %}

The copy_partitions: true option enables a zero-cost optimization: when partitions don’t change, BigQuery copies metadata instead of rewriting data.

What to avoid:

Using merge for event and fact tables that never update. insert_overwrite can be 10x+ faster because it doesn’t scan the destination.

Dynamic partition detection on huge tables introduces unpredictability. Static partition lists make your pipeline’s cost and runtime deterministic.

Forgetting copy_partitions: true when applicable. It’s free performance.

Performance impact: Cost scales linearly with partition size, not total table size. A 10TB table with 1GB daily partitions processes in the same time as a 100GB table with the same partition size. Best for: event streams, logs, immutable facts.


Pattern 8: Incremental Predicates for MERGE Optimization

Sometimes you need merge for tables with updates, late-arriving corrections, or CDC patterns. But without optimization, merge scans the full destination table for the ON clause match.

Incremental predicates add filters to the MERGE statement itself, enabling partition pruning on the destination table.

{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge',
partition_by={
'field': 'order_date',
'data_type': 'date'
},
cluster_by=['order_id'],
incremental_predicates=[
"DBT_INTERNAL_DEST.order_date >= date_sub(current_date(), interval 14 day)"
]
)
}}
SELECT
order_id,
customer_id,
order__status,
order_date,
order__updated_at
FROM {{ source('orders', 'raw_orders') }}
{% if is_incremental() %}
WHERE order__updated_at >= (SELECT DATE_SUB(MAX(order__updated_at), INTERVAL 3 DAY) FROM {{ this }})
{% endif %}

The predicate DBT_INTERNAL_DEST.order_date >= date_sub(current_date(), interval 14 day) tells BigQuery to only scan the last 14 days of the destination table when matching rows.

Two optimizations compound here: the predicate enables partition pruning, and cluster_by=['order_id'] (clustering on the unique key) enables cluster pruning within those partitions.

What to avoid:

Running merge without incremental_predicates on large partitioned tables. You’re scanning terabytes when you need gigabytes.

Forgetting to cluster on unique_key. The merge operation matches on this column, and clustering makes that match 50%+ faster.

Setting predicates too narrow and missing late-arriving data. If orders can be updated up to 7 days after creation, your predicate needs at least a 7-day window.

Performance impact: Achieves 88%+ reduction in destination table data scanned. Combined with unique_key clustering, total scan reduction can exceed 95%.


Marketing Analytics Patterns

These patterns solve problems specific to marketing data: custom session definitions and attribution modeling.

Pattern 9: Sessionization with LAG + Running Sum

GA4 defines sessions with a 30-minute inactivity timeout. But your business might need different rules: 15-minute timeouts for high-engagement apps, campaign-based session breaks, or cross-device stitching.

This pattern builds custom sessions from raw events using gap detection:

WITH events_with_gaps AS (
SELECT
user_pseudo_id,
event_timestamp,
event_name,
traffic_source.source AS traffic__source,
traffic_source.medium AS traffic__medium,
CASE
WHEN LAG(event_timestamp) OVER (
PARTITION BY user_pseudo_id ORDER BY event_timestamp
) IS NULL THEN 1 -- First event = new session
WHEN TIMESTAMP_DIFF(
event_timestamp,
LAG(event_timestamp) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp),
MINUTE
) > 30 THEN 1 -- Gap > 30 min = new session
ELSE 0
END AS event__is_new_session
FROM `project.analytics_XXXXX.events_*`
WHERE _TABLE_SUFFIX BETWEEN '20250101' AND '20250131'
),
sessionized AS (
SELECT
user_pseudo_id,
event_timestamp,
event_name,
traffic__source,
traffic__medium,
event__is_new_session,
SUM(event__is_new_session) OVER (
PARTITION BY user_pseudo_id
ORDER BY event_timestamp
) AS session_number
FROM events_with_gaps
)
SELECT
user_pseudo_id,
event_timestamp,
event_name,
traffic__source,
traffic__medium,
session_number,
CONCAT(user_pseudo_id, '_', session_number) AS custom_session_id
FROM sessionized;

The pattern works in two steps: LAG identifies session boundaries by comparing each event’s timestamp to the previous event, then a running sum of the boundary flags creates session numbers.

To add session-level metrics, aggregate in a subsequent CTE:

, session_metrics AS (
SELECT
custom_session_id,
user_pseudo_id,
MIN(event_timestamp) AS session__started_at,
MAX(event_timestamp) AS session__ended_at,
TIMESTAMP_DIFF(MAX(event_timestamp), MIN(event_timestamp), SECOND) AS session__duration_seconds,
COUNT(*) AS session__events,
COUNTIF(event_name = 'page_view') AS session__pageviews
FROM sessionized
GROUP BY custom_session_id, user_pseudo_id
)

What to avoid:

Relying solely on ga_session_id when your business logic differs. GA4’s session definition may not match your needs.

Session timeouts that are too short fragment genuine sessions; too long merges separate visits. Test with your actual data.

Extensions: Add campaign-based session splitting (new UTM parameters trigger new session), cross-device session stitching using user_id, or engagement-based rules (certain events always start new sessions).


Pattern 10: Multi-Touch Attribution Modeling

Last-click attribution is the default because it’s simple. It’s also wrong: it over-credits bottom-funnel channels like branded search while ignoring the awareness campaigns that started the journey.

This pattern calculates four attribution models in a single query: first-touch, last-touch, linear, and position-based.

WITH touchpoints AS (
SELECT
c.transaction_id,
c.transaction__revenue,
t.touchpoint__source,
t.touchpoint__medium,
t.touchpoint__campaign,
t.touchpoint__timestamp,
ROW_NUMBER() OVER (
PARTITION BY c.transaction_id ORDER BY t.touchpoint__timestamp
) AS touch_position,
ROW_NUMBER() OVER (
PARTITION BY c.transaction_id ORDER BY t.touchpoint__timestamp DESC
) AS reverse_position,
COUNT(*) OVER (PARTITION BY c.transaction_id) AS total_touches
FROM `project.dataset.conversions` c
JOIN `project.dataset.touchpoints` t
ON c.user_id = t.user_id
AND t.touchpoint__timestamp <= c.conversion__timestamp
AND t.touchpoint__timestamp >= TIMESTAMP_SUB(c.conversion__timestamp, INTERVAL 30 DAY) -- 30-day lookback
)
SELECT
touchpoint__source,
touchpoint__medium,
-- First-touch: 100% credit to first interaction
SUM(CASE WHEN touch_position = 1 THEN transaction__revenue ELSE 0 END) AS attribution__first_touch_revenue,
-- Last-touch: 100% credit to last interaction
SUM(CASE WHEN reverse_position = 1 THEN transaction__revenue ELSE 0 END) AS attribution__last_touch_revenue,
-- Linear: Equal credit to all touchpoints
SUM(transaction__revenue / total_touches) AS attribution__linear_revenue,
-- Position-based: 40% first, 40% last, 20% split among middle
SUM(transaction__revenue * CASE
WHEN total_touches = 1 THEN 1.0
WHEN touch_position = 1 THEN 0.4
WHEN reverse_position = 1 THEN 0.4
WHEN total_touches = 2 THEN 0 -- Edge case: only first and last, no middle
ELSE 0.2 / (total_touches - 2)
END) AS attribution__position_based_revenue
FROM touchpoints
GROUP BY touchpoint__source, touchpoint__medium
ORDER BY attribution__position_based_revenue DESC;

The lookback window (30 days in this example) is crucial business logic. Too short and you miss upper-funnel touchpoints. Too long and you attribute revenue to ancient, irrelevant interactions.

What to avoid:

Joining on user_id without time bounds attributes touchpoints from months or years ago. Always include a lookback window.

Forgetting edge cases. When total_touches = 2, there’s no middle to split the 20%. The formula 0.2 / (total_touches - 2) divides by zero. Handle it explicitly.

Performance notes: A single scan with window functions calculates all four models. For large datasets, consider pre-aggregating touchpoints to daily grain before the conversion join.

Extensions: Time-decay attribution (recent touchpoints get exponentially more credit), channel-path analysis with STRING_AGG to see common conversion paths, or custom weighting based on your business model.


Quick Reference

#PatternCategoryKey Benefit
1Partition + ClusteringOptimization10-100x cost reduction
2Materialized ViewsOptimizationAuto-optimized repeated queries
3HLL++ SketchesOptimizationComposable distinct counts
4Inline UNNESTNested DataClean GA4 parameter extraction
5QUALIFYWindow FunctionsElegant window filtering
6Gap-and-IslandWindow FunctionsStreak and sequence detection
7insert_overwritedbtLinear scaling for event tables
8Incremental Predicatesdbt88% destination scan reduction
9SessionizationMarketingCustom session definitions
10Attribution ModelingMarketingMulti-touch credit allocation

Each pattern encodes a way of thinking about data problems. Partition pruning teaches you to think about data layout. Window functions teach you to think in terms of row context. HLL sketches teach you that approximate answers at 1% of the cost are often better than exact answers you can’t afford to compute.

The best analytics engineers I know have internalized these patterns so deeply they apply them automatically. That fluency comes from practice. Pick one pattern you don’t use today, apply it to a real problem this week, and watch it become second nature.