Adrienne Vermorel
10 BigQuery SQL Patterns Every Analytics Engineer Should Know
Most analytics engineers learn BigQuery through trial and error, picking up patterns ad hoc from Stack Overflow answers and inherited codebases. It works, but it’s slow and expensive.
After auditing hundreds of dbt models and years of production BigQuery work, I’ve identified a canon of patterns that separate functional SQL from elegant, cost-efficient SQL. These are fundamental techniques that compound over time, the kind of knowledge that makes you faster at solving new problems because you’ve internalized the right primitives.
This article covers 10 patterns across 5 categories: query optimization, nested data handling, window functions, dbt incremental strategies, and marketing analytics. Each pattern includes production-ready code, anti-patterns to avoid, and real performance benchmarks.
Query Optimization & Cost Awareness
These three patterns directly impact your BigQuery bill and query performance. Master them first.
Pattern 1: Partition + Clustering Combo
Partitioning divides your table into segments by date, timestamp, or integer range. Clustering sorts data within those partitions by up to four columns. Combined, they let BigQuery skip irrelevant data blocks entirely, turning full table scans into targeted reads.
-- Create a table with both strategiesCREATE TABLE `project.dataset.events`PARTITION BY DATE(event_timestamp)CLUSTER BY user_id, event_nameASSELECT event_timestamp, user_id, event_name, event_paramsFROM raw.events;
-- Enforce partition filters at the table levelALTER TABLE `project.dataset.events`SET OPTIONS (require_partition_filter = true);
-- Query that triggers both partition and cluster pruningSELECT user_id, event_name, COUNT(*)FROM `project.dataset.events`WHERE DATE(event_timestamp) = '2025-01-01' -- Partition pruning AND user_id = 'abc123' -- Cluster pruningGROUP BY 1, 2;The require_partition_filter option prevents anyone from accidentally running a full table scan. Queries without a partition filter simply fail.
What to avoid:
Using functions on partition columns defeats pruning. If your table is partitioned by TIMESTAMP, the filter WHERE DATE(event_timestamp) = '2025-01-01' forces a full scan because BigQuery can’t evaluate the function at planning time. Instead, store a separate DATE column and partition on that, or use a range filter: WHERE event_timestamp >= '2025-01-01' AND event_timestamp < '2025-01-02'.
Subqueries in partition filters also prevent pruning. WHERE event_date = (SELECT MAX(event_date) FROM other_table) won’t prune because BigQuery needs literal values at query planning time. Materialize the date first or use a variable.
Over-partitioning creates its own problems. If your partitions contain less than 1GB each, the metadata overhead outweighs the pruning benefits. For small tables, clustering alone often performs better.
Performance impact: Partitioning alone delivers 10–100x cost reduction on large tables. Adding clustering provides another 2–10x reduction within partitions. On a 10TB table with proper filters, you might scan just 10–100GB.
Pattern 2: Materialized Views for Repeated Aggregations
Dashboards and reports often run the same heavy aggregations hourly. Each execution scans the same data, burns the same slots, costs the same money. Materialized views precompute those results and refresh incrementally.
BigQuery’s optimizer uses MVs transparently. Queries against the base table get automatically rewritten to use the MV when beneficial, with no code changes required.
CREATE MATERIALIZED VIEW `project.dataset.mv_daily_revenue`OPTIONS ( enable_refresh = true, refresh_interval_minutes = 30, max_staleness = INTERVAL '4' HOUR -- Accept staleness for dashboards)ASSELECT DATE(transaction_time) AS transaction_date, product_category, SUM(amount) AS total_revenue, COUNT(*) AS transaction_countFROM `project.dataset.transactions`GROUP BY 1, 2;The max_staleness option is the key tuning lever. Setting it to 4 hours tells BigQuery “I’m okay with data that’s up to 4 hours old”, which lets the optimizer use the MV more aggressively and reduces refresh frequency. For real-time dashboards, omit it. For weekly reports, set it to days.
What to avoid:
Non-deterministic functions like CURRENT_TIMESTAMP() in your MV definition block incremental refresh. The MV falls back to full recomputation, defeating the purpose.
MVs also can’t include ORDER BY, LIMIT, HAVING, or subqueries in the SELECT clause. Keep them simple: aggregations over filtered base tables.
Don’t ignore staleness settings. The default is always-fresh, which means constant refresh overhead. If your dashboard users can tolerate 30-minute-old data, tell BigQuery.
Performance impact: Google reports queries scanning 500GB reduced to 2.5MB (a 200,000x reduction) with properly designed MVs. Query speedup ranges from 5-100x for complex aggregations. Limits: 20 MVs per dataset, 100 per project.
Pattern 3: HyperLogLog++ Sketches for Composable Distinct Counts
Suppose you need unique user counts for 7-day, 30-day, and 90-day rolling windows. With COUNT(DISTINCT user_id), each window requires scanning all the underlying data. Triple the windows, triple the cost.
HyperLogLog++ sketches solve this. They’re probabilistic data structures that estimate cardinality with ~97-99% accuracy, and they’re additive. Precompute daily sketches, then merge them for any date range in milliseconds.
-- Step 1: Create daily sketches (run once, append daily)CREATE OR REPLACE TABLE `project.dataset.daily_user_sketches` ASSELECT event_date, HLL_COUNT.INIT(user_id, 15) AS user_sketch -- Precision 10–24FROM `project.dataset.events`GROUP BY event_date;
-- Step 2: Merge sketches for any date range (instant, regardless of range)SELECT HLL_COUNT.MERGE(user_sketch) AS unique_users_30dFROM `project.dataset.daily_user_sketches`WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY);
-- Simpler alternative for one-off queriesSELECT APPROX_COUNT_DISTINCT(user_id) AS approx_usersFROM `project.dataset.events`WHERE event_date >= '2025-01-01';The precision parameter (10-24) controls the tradeoff between accuracy and sketch size. Precision 15 gives approximately 0.4% error, which is good enough for most analytics use cases. GA4 uses precision 14 for user counts and 12 for sessions internally.
What to avoid:
APPROX_COUNT_DISTINCT is convenient but not composable. You can’t add yesterday’s approximate count to today’s and get a valid 2-day count. For rolling windows, you must use HLL_COUNT.INIT and HLL_COUNT.MERGE.
Low precision values (<12) when accuracy matters will produce noticeable variance in results.
Performance impact: 4-8x faster execution than exact COUNT(DISTINCT), 8x lower slot consumption. The real win is rolling windows: a 90-day unique user count that took 30 seconds now takes 200 milliseconds.
Nested Data Mastery
BigQuery’s nested and repeated fields are powerful but poorly understood. This pattern is essential for anyone working with GA4, Firebase, or event data.
Pattern 4: Inline UNNEST Subqueries
GA4 and Firebase store event parameters as arrays of structs (a schema like event_params ARRAY<STRUCT<key STRING, value STRUCT<string_value STRING, int_value INT64, ...>>>). The naive approach uses CROSS JOIN UNNEST, which explodes rows and breaks when parameters are missing.
The correct approach uses inline UNNEST subqueries. They extract values cleanly, return NULL for missing parameters, and keep one row per event.
-- The right way: inline UNNEST subqueriesSELECT event_date, event_name, user_pseudo_id, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'page_location') AS page_location, (SELECT value.int_value FROM UNNEST(event_params) WHERE key = 'ga_session_id') AS ga_session_id, (SELECT value.string_value FROM UNNEST(event_params) WHERE key = 'campaign') AS campaign -- NULL if missingFROM `project.analytics_XXXXX.events_*`WHERE _TABLE_SUFFIX BETWEEN '20250101' AND '20250131' AND event_name = 'page_view';For repeated use, wrap the extraction in a temporary function:
CREATE TEMP FUNCTION GetParamString(params ANY TYPE, param_key STRING)AS ((SELECT value.string_value FROM UNNEST(params) WHERE key = param_key));
CREATE TEMP FUNCTION GetParamInt(params ANY TYPE, param_key STRING)AS ((SELECT value.int_value FROM UNNEST(params) WHERE key = param_key));
SELECT event_name, GetParamString(event_params, 'page_location') AS page_location, GetParamInt(event_params, 'ga_session_id') AS ga_session_idFROM `project.analytics_XXXXX.events_*`WHERE _TABLE_SUFFIX = '20250115';What to avoid:
Multiple CROSS JOIN UNNEST statements for optional parameters will return zero rows if any parameter is missing on any event.
Forgetting the _TABLE_SUFFIX filter on date-sharded tables scans the entire dataset, potentially years of data. Always filter it first.
Using SAFE_OFFSET or array indexing assumes parameters appear in a consistent order. They don’t. Always filter by key.
Performance impact: Nested data queries avoid shuffle operations, making them faster than equivalent JOINs on flattened tables. Storage savings reach 35-74% compared to fully denormalized schemas.
Window Function Patterns
Window functions are BigQuery’s secret weapon for analytics. These two patterns unlock elegant solutions to common problems.
Pattern 5: QUALIFY for Elegant Window Filtering
Filtering on window function results (like “get the most recent order per customer”) traditionally requires a subquery or CTE:
-- The old way: subquery wrapperSELECT customer_id, order_id, order_date, order__statusFROM ( SELECT customer_id, order_id, order_date, order__status, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rn FROM `project.dataset.base__shopify__orders`)WHERE rn = 1;BigQuery’s QUALIFY clause does this inline:
-- The elegant way: QUALIFYSELECT customer_id, order_id, order_date, order__statusFROM `project.dataset.base__shopify__orders`QUALIFY ROW_NUMBER() OVER ( PARTITION BY customer_id ORDER BY order_date DESC) = 1;Same result, less nesting, easier to read. QUALIFY works with any window function:
-- Top 3 products per category by revenueSELECT product__category, product__name, product__revenueFROM `project.dataset.mrt__sales__products`QUALIFY RANK() OVER (PARTITION BY product__category ORDER BY product__revenue DESC) <= 3;
-- First and last event per sessionSELECT session_id, event_name, event_time, user_idFROM `project.dataset.int__ga4__events_sessionized`QUALIFY ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY event_time) = 1 OR ROW_NUMBER() OVER (PARTITION BY session_id ORDER BY event_time DESC) = 1;What to avoid:
Wrapping simple window filters in CTEs when QUALIFY would suffice. For complex multi-step transformations, CTEs are still appropriate, but for single-step filtering, QUALIFY is cleaner.
Performance impact: The execution plan is identical to the subquery approach. QUALIFY is pure syntactic sugar, but it makes your code significantly more readable.
The execution order is: FROM → WHERE → GROUP BY → HAVING → WINDOW → QUALIFY → DISTINCT → ORDER BY → LIMIT.
Pattern 6: Gap-and-Island Detection for Streaks
Identifying consecutive sequences (user login streaks, production uptime, date ranges without gaps) is surprisingly tricky. The gap-and-island pattern uses a clever ROW_NUMBER subtraction to group consecutive values.
The insight is that subtracting a sequential row number from consecutive dates produces the same result for each consecutive group, which becomes a grouping key.
WITH numbered_activity AS ( SELECT user_id, activity_date, -- Subtract row number from date; consecutive dates yield same value DATE_SUB(activity_date, INTERVAL ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY activity_date ) DAY) AS island_id FROM `project.dataset.user_activity`)
SELECT user_id, MIN(activity_date) AS streak__started_at, MAX(activity_date) AS streak__ended_at, DATE_DIFF(MAX(activity_date), MIN(activity_date), DAY) + 1 AS streak__duration_daysFROM numbered_activityGROUP BY user_id, island_idHAVING DATE_DIFF(MAX(activity_date), MIN(activity_date), DAY) + 1 >= 7; -- Only 7+ day streaksExample data showing the island calculation:
| activity_date | ROW_NUMBER | date - row_number (island_id) |
|---|---|---|
| 2025-01-01 | 1 | 2024-12-31 |
| 2025-01-02 | 2 | 2024-12-31 |
| 2025-01-03 | 3 | 2024-12-31 |
| 2025-01-05 | 4 | 2025-01-01 (gap creates new island) |
| 2025-01-06 | 5 | 2025-01-01 |
The first three dates form one island (streak of 3). The last two form another (streak of 2).
What to avoid:
Self-joins for streak detection have O(n²) complexity. The window function approach is O(n).
Recursive CTEs might seem natural for this problem, but BigQuery doesn’t support them. The gap-and-island pattern is the standard solution.
Use cases: Gaming achievement streaks, habit tracking apps, SLA uptime calculation, consolidating overlapping date ranges, identifying gaps in time series data.
dbt Incremental Strategies
dbt on BigQuery offers multiple incremental strategies. Choosing wrong means either correctness issues or 10x higher costs than necessary.
Pattern 7: insert_overwrite with Static Partitions
The default merge strategy works by scanning the entire destination table to find matching rows. For append-heavy event tables that rarely update, this is wasteful. insert_overwrite replaces entire partitions with no row matching required.
{% set partitions_to_replace = [ 'current_date()', 'date_sub(current_date(), interval 1 day)', 'date_sub(current_date(), interval 2 day)'] %}
{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={ 'field': 'event_date', 'data_type': 'date' }, partitions=partitions_to_replace, cluster_by=['user_id', 'event_name'], copy_partitions=true )}}
SELECT DATE(event_timestamp) AS event_date, user_id, event_name, event_paramsFROM {{ source('analytics', 'raw_events') }}{% if is_incremental() %}WHERE DATE(event_timestamp) IN ({{ partitions_to_replace | join(', ') }}){% endif %}The copy_partitions: true option enables a zero-cost optimization: when partitions don’t change, BigQuery copies metadata instead of rewriting data.
What to avoid:
Using merge for event and fact tables that never update. insert_overwrite can be 10x+ faster because it doesn’t scan the destination.
Dynamic partition detection on huge tables introduces unpredictability. Static partition lists make your pipeline’s cost and runtime deterministic.
Forgetting copy_partitions: true when applicable. It’s free performance.
Performance impact: Cost scales linearly with partition size, not total table size. A 10TB table with 1GB daily partitions processes in the same time as a 100GB table with the same partition size. Best for: event streams, logs, immutable facts.
Pattern 8: Incremental Predicates for MERGE Optimization
Sometimes you need merge for tables with updates, late-arriving corrections, or CDC patterns. But without optimization, merge scans the full destination table for the ON clause match.
Incremental predicates add filters to the MERGE statement itself, enabling partition pruning on the destination table.
{{ config( materialized='incremental', unique_key='order_id', incremental_strategy='merge', partition_by={ 'field': 'order_date', 'data_type': 'date' }, cluster_by=['order_id'], incremental_predicates=[ "DBT_INTERNAL_DEST.order_date >= date_sub(current_date(), interval 14 day)" ] )}}
SELECT order_id, customer_id, order__status, order_date, order__updated_atFROM {{ source('orders', 'raw_orders') }}{% if is_incremental() %}WHERE order__updated_at >= (SELECT DATE_SUB(MAX(order__updated_at), INTERVAL 3 DAY) FROM {{ this }}){% endif %}The predicate DBT_INTERNAL_DEST.order_date >= date_sub(current_date(), interval 14 day) tells BigQuery to only scan the last 14 days of the destination table when matching rows.
Two optimizations compound here: the predicate enables partition pruning, and cluster_by=['order_id'] (clustering on the unique key) enables cluster pruning within those partitions.
What to avoid:
Running merge without incremental_predicates on large partitioned tables. You’re scanning terabytes when you need gigabytes.
Forgetting to cluster on unique_key. The merge operation matches on this column, and clustering makes that match 50%+ faster.
Setting predicates too narrow and missing late-arriving data. If orders can be updated up to 7 days after creation, your predicate needs at least a 7-day window.
Performance impact: Achieves 88%+ reduction in destination table data scanned. Combined with unique_key clustering, total scan reduction can exceed 95%.
Marketing Analytics Patterns
These patterns solve problems specific to marketing data: custom session definitions and attribution modeling.
Pattern 9: Sessionization with LAG + Running Sum
GA4 defines sessions with a 30-minute inactivity timeout. But your business might need different rules: 15-minute timeouts for high-engagement apps, campaign-based session breaks, or cross-device stitching.
This pattern builds custom sessions from raw events using gap detection:
WITH events_with_gaps AS ( SELECT user_pseudo_id, event_timestamp, event_name, traffic_source.source AS traffic__source, traffic_source.medium AS traffic__medium, CASE WHEN LAG(event_timestamp) OVER ( PARTITION BY user_pseudo_id ORDER BY event_timestamp ) IS NULL THEN 1 -- First event = new session WHEN TIMESTAMP_DIFF( event_timestamp, LAG(event_timestamp) OVER (PARTITION BY user_pseudo_id ORDER BY event_timestamp), MINUTE ) > 30 THEN 1 -- Gap > 30 min = new session ELSE 0 END AS event__is_new_session FROM `project.analytics_XXXXX.events_*` WHERE _TABLE_SUFFIX BETWEEN '20250101' AND '20250131'),
sessionized AS ( SELECT user_pseudo_id, event_timestamp, event_name, traffic__source, traffic__medium, event__is_new_session, SUM(event__is_new_session) OVER ( PARTITION BY user_pseudo_id ORDER BY event_timestamp ) AS session_number FROM events_with_gaps)
SELECT user_pseudo_id, event_timestamp, event_name, traffic__source, traffic__medium, session_number, CONCAT(user_pseudo_id, '_', session_number) AS custom_session_idFROM sessionized;The pattern works in two steps: LAG identifies session boundaries by comparing each event’s timestamp to the previous event, then a running sum of the boundary flags creates session numbers.
To add session-level metrics, aggregate in a subsequent CTE:
, session_metrics AS ( SELECT custom_session_id, user_pseudo_id, MIN(event_timestamp) AS session__started_at, MAX(event_timestamp) AS session__ended_at, TIMESTAMP_DIFF(MAX(event_timestamp), MIN(event_timestamp), SECOND) AS session__duration_seconds, COUNT(*) AS session__events, COUNTIF(event_name = 'page_view') AS session__pageviews FROM sessionized GROUP BY custom_session_id, user_pseudo_id)What to avoid:
Relying solely on ga_session_id when your business logic differs. GA4’s session definition may not match your needs.
Session timeouts that are too short fragment genuine sessions; too long merges separate visits. Test with your actual data.
Extensions: Add campaign-based session splitting (new UTM parameters trigger new session), cross-device session stitching using user_id, or engagement-based rules (certain events always start new sessions).
Pattern 10: Multi-Touch Attribution Modeling
Last-click attribution is the default because it’s simple. It’s also wrong: it over-credits bottom-funnel channels like branded search while ignoring the awareness campaigns that started the journey.
This pattern calculates four attribution models in a single query: first-touch, last-touch, linear, and position-based.
WITH touchpoints AS ( SELECT c.transaction_id, c.transaction__revenue, t.touchpoint__source, t.touchpoint__medium, t.touchpoint__campaign, t.touchpoint__timestamp, ROW_NUMBER() OVER ( PARTITION BY c.transaction_id ORDER BY t.touchpoint__timestamp ) AS touch_position, ROW_NUMBER() OVER ( PARTITION BY c.transaction_id ORDER BY t.touchpoint__timestamp DESC ) AS reverse_position, COUNT(*) OVER (PARTITION BY c.transaction_id) AS total_touches FROM `project.dataset.conversions` c JOIN `project.dataset.touchpoints` t ON c.user_id = t.user_id AND t.touchpoint__timestamp <= c.conversion__timestamp AND t.touchpoint__timestamp >= TIMESTAMP_SUB(c.conversion__timestamp, INTERVAL 30 DAY) -- 30-day lookback)
SELECT touchpoint__source, touchpoint__medium,
-- First-touch: 100% credit to first interaction SUM(CASE WHEN touch_position = 1 THEN transaction__revenue ELSE 0 END) AS attribution__first_touch_revenue,
-- Last-touch: 100% credit to last interaction SUM(CASE WHEN reverse_position = 1 THEN transaction__revenue ELSE 0 END) AS attribution__last_touch_revenue,
-- Linear: Equal credit to all touchpoints SUM(transaction__revenue / total_touches) AS attribution__linear_revenue,
-- Position-based: 40% first, 40% last, 20% split among middle SUM(transaction__revenue * CASE WHEN total_touches = 1 THEN 1.0 WHEN touch_position = 1 THEN 0.4 WHEN reverse_position = 1 THEN 0.4 WHEN total_touches = 2 THEN 0 -- Edge case: only first and last, no middle ELSE 0.2 / (total_touches - 2) END) AS attribution__position_based_revenue
FROM touchpointsGROUP BY touchpoint__source, touchpoint__mediumORDER BY attribution__position_based_revenue DESC;The lookback window (30 days in this example) is crucial business logic. Too short and you miss upper-funnel touchpoints. Too long and you attribute revenue to ancient, irrelevant interactions.
What to avoid:
Joining on user_id without time bounds attributes touchpoints from months or years ago. Always include a lookback window.
Forgetting edge cases. When total_touches = 2, there’s no middle to split the 20%. The formula 0.2 / (total_touches - 2) divides by zero. Handle it explicitly.
Performance notes: A single scan with window functions calculates all four models. For large datasets, consider pre-aggregating touchpoints to daily grain before the conversion join.
Extensions: Time-decay attribution (recent touchpoints get exponentially more credit), channel-path analysis with STRING_AGG to see common conversion paths, or custom weighting based on your business model.
Quick Reference
| # | Pattern | Category | Key Benefit |
|---|---|---|---|
| 1 | Partition + Clustering | Optimization | 10-100x cost reduction |
| 2 | Materialized Views | Optimization | Auto-optimized repeated queries |
| 3 | HLL++ Sketches | Optimization | Composable distinct counts |
| 4 | Inline UNNEST | Nested Data | Clean GA4 parameter extraction |
| 5 | QUALIFY | Window Functions | Elegant window filtering |
| 6 | Gap-and-Island | Window Functions | Streak and sequence detection |
| 7 | insert_overwrite | dbt | Linear scaling for event tables |
| 8 | Incremental Predicates | dbt | 88% destination scan reduction |
| 9 | Sessionization | Marketing | Custom session definitions |
| 10 | Attribution Modeling | Marketing | Multi-touch credit allocation |
Each pattern encodes a way of thinking about data problems. Partition pruning teaches you to think about data layout. Window functions teach you to think in terms of row context. HLL sketches teach you that approximate answers at 1% of the cost are often better than exact answers you can’t afford to compute.
The best analytics engineers I know have internalized these patterns so deeply they apply them automatically. That fluency comes from practice. Pick one pattern you don’t use today, apply it to a real problem this week, and watch it become second nature.