BigQuery Partition Pruning Patterns

Partitioning divides a table into segments by date, timestamp, or integer range. Clustering sorts data within those partitions by up to four columns. Combined, both mechanisms allow BigQuery to skip irrelevant data blocks, reducing full table scans to targeted reads.

On a well-designed table with proper filters, a query may scan 10–100 GB out of a 10 TB table — 100–1000x less data billed under on-demand pricing.

The Partition + Clustering Combo

The standard pattern for event and fact tables is to partition by date and cluster by the columns you filter on most frequently:

CREATE TABLE `project.dataset.events`
PARTITION BY DATE(event_timestamp)
CLUSTER BY user_id, event_name
AS
SELECT
  event_timestamp,
  user_id,
  event_name,
  event_params
FROM raw.events;

A query that filters on both the partition column and a clustered column triggers both types of pruning:

SELECT user_id, event_name, COUNT(*)
FROM `project.dataset.events`
WHERE DATE(event_timestamp) = '2025-01-01'  -- Partition pruning
  AND user_id = 'abc123'                     -- Cluster pruning
GROUP BY 1, 2;

Partitioning alone delivers 10-100x cost reduction on large tables. Adding clustering provides another 2-10x reduction within partitions.

Enforcing Partition Filters

The require_partition_filter option prevents accidental full table scans by making queries without a partition filter fail outright:

ALTER TABLE `project.dataset.events`
SET OPTIONS (require_partition_filter = true);

This is useful on shared datasets where unfiltered SELECT * queries can occur. A full scan on a 10 TB table costs $62.50 under on-demand pricing; require_partition_filter prevents that query from running.

In a dbt context, your base models should always include partition filters. If you’re building on top of date-sharded GA4 exports, the _TABLE_SUFFIX filter serves the same purpose.

Anti-Patterns That Defeat Pruning

These patterns silently prevent partition pruning. BigQuery does not warn; the query runs, returns correct results, and costs 10–100x more than it should.

Functions on Partition Columns

If your table is partitioned by TIMESTAMP, a filter like WHERE DATE(event_timestamp) = '2025-01-01' forces a full scan because BigQuery can’t evaluate the function at planning time. The optimizer needs to match the filter expression to the partition scheme, and wrapping the column in a function breaks that match.

The fix depends on your schema:

-- Option A: Partition on a DATE column directly
CREATE TABLE events
PARTITION BY event_date  -- Store a separate DATE column
...

-- Option B: Use a range filter on the timestamp
WHERE event_timestamp >= '2025-01-01'
  AND event_timestamp < '2025-01-02'

Option A is cleaner for tables you control. Option B works when you’re querying tables you didn’t design.

Subqueries in Partition Filters

-- This does NOT prune
WHERE event_date = (SELECT MAX(event_date) FROM other_table)

BigQuery needs literal values at query planning time to perform partition pruning. A subquery produces a value at execution time, which is too late. The optimizer can’t know which partition to skip until it already knows the answer to the subquery.

Workarounds:

Materialize the date into a variable first (in scripting mode)
Use a Jinja macro in dbt to inject the date as a literal
Accept the cost if the table is small enough that it doesn’t matter

Over-Partitioning

If your partitions contain less than 1 GB each, the metadata overhead outweighs the pruning benefits. Each partition has fixed overhead for metadata management, and with many tiny partitions, BigQuery spends more time managing partition metadata than it saves from pruning.

The practical threshold: if your daily partitions are under 1 GB, consider using a wider partition grain (monthly instead of daily) with clustering, or clustering alone for smaller tables. Clustering doesn’t have a minimum size threshold the way partitioning does.

Clustering Column Order Matters

Clustering sorts data hierarchically by up to four columns. The first column has the strongest pruning effect. Place your most frequently filtered column first:

CLUSTER BY user_id, event_name, campaign

If most of your queries filter on user_id, this order is correct. If most queries filter on event_name but only sometimes on user_id, reverse them. Check INFORMATION_SCHEMA.JOBS to see which columns appear most often in WHERE clauses.

Clustering benefits do not appear in dry-run cost estimates. A dry run may estimate 50 GB, but the actual query scans 5 GB due to cluster pruning. Check actual bytes billed in INFORMATION_SCHEMA.JOBS to evaluate clustering effectiveness.

Interaction with dbt Incremental Models

Partition pruning becomes even more critical for dbt incremental models. The insert_overwrite strategy with static partitions is designed to exploit partition pruning: it replaces only the partitions you specify, and costs scale with partition size rather than total table size.

For merge strategy models, incremental_predicates add partition filters to the destination scan:

{{ config(
    incremental_predicates=[
      "DBT_INTERNAL_DEST.event_date >= date_sub(current_date(), interval 14 day)"
    ]
) }}

Without this predicate, a merge on a 10 TB partitioned table still scans all 10 TB on the destination side — even though the partitioning exists. The predicate tells BigQuery which partitions to check during the merge match, achieving 88%+ scan reduction on the destination table.

Measuring Pruning Effectiveness

The quickest way to check whether your queries are actually pruning:

SELECT
  job_id,
  query,
  total_bytes_billed,
  total_bytes_processed,
  ROUND(total_bytes_billed / POW(1024, 3), 2) AS gb_billed
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
  AND referenced_tables IS NOT NULL
ORDER BY total_bytes_billed DESC
LIMIT 20;

Compare total_bytes_billed against the table’s total size. If you’re scanning more than 5-10% of a large partitioned + clustered table on queries with proper filters, something is off. Check the anti-patterns above.

In the BigQuery architecture, partitioning and clustering are the primary mechanisms for reducing how much data Dremel reads from Colossus.