BigQuery Partitioning vs Clustering Decision Framework

This guide covers when to use partitioning, clustering, or both. The decision hinges on three factors: dominant query patterns, table size, and whether partition-level operations (expiration, targeted deletes, predictable cost estimates) are required. Applied correctly, the combination can reduce BigQuery costs by 90–99%.

The Decision Tree

flowchart TD
    A{Table > 64MB?}
    A -->|No| B[Skip optimization]
    A -->|Yes| C{80%+ queries filter<br>on one date column?}
    C -->|Yes| D[Partition by date]
    C -->|No| E{High-cardinality<br>filters?}
    D --> F{Partitions > 10GB?}
    F -->|Yes| G[Partition + Cluster]
    F -->|No| H[Monthly partition<br>+ Cluster]
    E -->|Yes| I[Cluster only]
    E -->|No| J[Case by case]

Work through it top to bottom. The first question eliminates tables too small to benefit. The second separates date-dominant workloads (where partitioning shines) from multi-dimensional access patterns (where clustering is the better fit).

Use Partitioning Alone When

Your queries consistently filter on a single date/timestamp column — 80% or more of queries include a date range filter. Partitioning gives you three things that clustering can’t:

Predictable cost estimates before execution. Partition pruning tells you exactly how many bytes will be scanned upfront. Clustering estimates aren’t finalized until the query completes. If your team cares about knowing what a query costs before running it, partitioning delivers.

Partition lifecycle management. You can set automatic expiration (partition_expiration_days), delete entire date ranges efficiently, or write to specific partitions without touching others. This is essential for data retention policies and compliance workflows.

Efficient incremental processing. The insert_overwrite strategy in dbt replaces entire partitions atomically — no row-level comparison, no full table scan. This requires partitioning.

The condition: each partition should average at least 10GB. Smaller partitions create metadata overhead without proportional benefit. If your daily data volume is under 10GB per day, consider monthly partitioning combined with clustering for sub-month granularity.

Use Clustering Alone When

Clustering fits when partitioning doesn’t match your access patterns or table size:

Table is 64MB-10GB. Too small for meaningful partitions, but large enough for block pruning benefits. The metadata overhead of partitions isn’t justified at this scale, but clustering still delivers measurable scan reduction.

Multiple filter columns with no single dominant one. If queries filter on user_id, product_id, region, and status in various combinations, clustering handles this. Partitioning can only address one column.

High-cardinality filters. Columns like user_id or session_id have too many distinct values for partitioning — you’d hit the 10,000 partition limit immediately. Clustering handles high cardinality gracefully because it works at the block level.

Partitioning would create tiny partitions. If daily data volume is 100MB, daily partitions are wasteful. Each partition adds metadata overhead, and BigQuery must stitch them together at query time. Cluster instead.

Use Both When

Combine both when you have large tables with multi-dimensional query patterns:

Tables exceeding 10GB per partition where queries filter on date and other dimensions. This is the most common scenario for production analytics tables.

Event tables where the standard pattern is: partition by transaction/event date, cluster by commonly filtered dimension keys. The partition handles the time dimension; clustering handles the rest.

You need cost predictability plus granular filtering. Partitioning gives you predictable byte estimates on the time dimension. Clustering gives you additional scan reduction on other filter columns.

Size Thresholds That Actually Matter

Google’s documentation states tables or partitions larger than 64MB benefit from clustering, and their Recommender system suggests partitioning for tables exceeding 100GB and clustering for tables exceeding 10GB. In practice, the thresholds that drive real decisions look like this:

Table/Partition Size	Recommendation
< 64MB	Skip optimization entirely
64MB - 1GB	Clustering only if heavily filtered
1GB - 10GB	Clustering recommended
10GB - 100GB	Both clustering + partitioning
> 100GB	Both mandatory; enforce partition filters

The 1GB threshold is where clustering improvements become meaningful in production. Above 100GB, skipping optimization carries significant cost risk — a single unfiltered query on a 100GB table costs $0.63 on-demand; on a 10TB table, $62.50.

The Small Partitions Trap

Daily partitioning with 100MB of daily data creates 365 tiny partitions per year. Each partition adds metadata overhead, and BigQuery must stitch them together at query time. This is one of the most common mistakes with partitioning.

The fix: use monthly partitioning combined with clustering on the date column for sub-month granularity. You get the lifecycle management benefits of partitioning (expiration, partition-level operations) without the overhead of hundreds of small partitions. Queries that filter by specific dates still benefit from clustering within each monthly partition.

Validating Your Choices

Measure before and after making changes.

Before Deployment

Check current table stats using INFORMATION_SCHEMA:

SELECT
  table_name,
  total_rows,
  total_logical_bytes / POW(1024, 3) AS size_gb,
  ARRAY_TO_STRING(clustering_columns, ', ') AS clustering
FROM `project.dataset.INFORMATION_SCHEMA.TABLE_STORAGE`
WHERE table_name = 'your_table'

Use BigQuery’s Recommender (the lightbulb icon in the console) for ML-based suggestions based on actual query patterns. These recommendations are starting points; they tend toward conservative thresholds.

Dry-run queries to see bytes scanned before execution. Note that dry runs only reflect partition pruning savings — they don’t show clustering benefits, which are only visible in actual execution.

bq query --dry_run --use_legacy_sql=false 'SELECT ...'

After Deployment

Monitor bytes scanned and slot time in BigQuery’s query history. Compare before/after for representative queries:

SELECT
  total_bytes_processed,
  total_bytes_billed,
  total_slot_ms
FROM `region-us`.INFORMATION_SCHEMA.JOBS
WHERE job_id = 'your-job-id'

Watch for partition filter warnings in your dbt logs. They indicate queries that couldn’t prune partitions — a sign that your filter expressions might be defeating pruning.

If queries with proper filters scan more than 5–10% of a large partitioned and clustered table, check the anti-patterns in BigQuery Partition Pruning Patterns — functions on partition columns, subqueries in filters, and wrong clustering column order are the common causes.

dbt Configuration Gotchas

Three dbt-specific traps when working with partitioning and clustering:

_dbt_max_partition is BigQuery SQL, not Jinja. Use it directly in your SQL without double brackets:

-- Correct
WHERE event_date >= _dbt_max_partition

-- Wrong: will fail
WHERE event_date >= {{ _dbt_max_partition }}

Changing the partition column causes data loss. When you modify partition_by on an existing incremental table, dbt’s insert_overwrite deletes partitions based on the new column’s values — which may not align with existing data. Always run --full-refresh when modifying partition configuration.

Data type mismatches break pruning silently. If your partition column is DATE but you filter with a TIMESTAMP literal, pruning won’t occur. BigQuery won’t warn you — the query runs, returns correct results, and costs 10-100x more than it should. Be explicit about types in your filter conditions.