Adrienne Vermorel
Partitioning vs. Clustering: The Decision Framework
Getting partitioning and clustering right can reduce your BigQuery costs by 90-99%. Getting it wrong (or worse, not thinking about it at all) means you’re burning money on every query.
Most analytics engineers understand what partitioning and clustering do. The harder question is when to use each, and how they interact with your dbt models. This tutorial provides a practical decision framework based on your actual query patterns, table sizes, and operational requirements.
Quick refresher: how they work differently
The fundamental mechanical difference shapes everything that follows.
Partitioning: physical boundaries
Partitioning physically divides your table into segments based on a single column’s values. When you query with a filter on the partition column, BigQuery performs partition pruning: it completely skips partitions that don’t match your filter. Those bytes are never scanned, never billed.
BigQuery supports three partitioning types:
Time-unit column partitioning uses a DATE, TIMESTAMP, or DATETIME column with HOUR, DAY, MONTH, or YEAR granularity. This is the standard choice for event data and time-series workloads.
Ingestion-time partitioning assigns rows to partitions based on when BigQuery ingests them, using the pseudocolumns _PARTITIONTIME and _PARTITIONDATE. Useful when your source data lacks reliable timestamps.
Integer range partitioning divides data by INTEGER column ranges you define (start, end, interval). Works for ID-based access patterns or customer segmentation scenarios.
Key constraints: you can only partition by one column, you’re limited to 10,000 partitions per table, and partition pruning only works with constant expressions. A filter like WHERE event_date >= (SELECT MAX(date) FROM other_table) won’t prune because the value depends on a subquery.
Clustering: sorted blocks
Clustering sorts data within storage blocks based on up to four columns. This enables block-level pruning—BigQuery maintains metadata about value ranges in each block and skips blocks that can’t contain matching rows.
Unlike partitioning, clustering doesn’t create physical divisions. BigQuery uses an LSM tree-like approach: new data arrives in locally sorted “delta” blocks, and a background process merges these into fully sorted baseline blocks. This automatic re-clustering happens at no cost (no slot consumption, no charges).
Column order matters critically. BigQuery sorts hierarchically by your columns in order. Queries only benefit from clustering when filtering on a prefix of the clustering columns. If you cluster by [region, customer_id, product_id]:
- Filter on
regionalone → optimized - Filter on
regionANDcustomer_id→ optimized - Filter on
customer_idalone, skippingregion→ not optimized
Partitioning creates hard boundaries that eliminate entire segments. Clustering organizes data within those segments for finer-grained skipping.
The decision framework
The choice between partitioning and clustering hinges on three factors: your dominant query patterns, table size, and whether you need partition-level operations (expiration, targeted deletes, predictable cost estimates).
Here’s the decision tree:
flowchart TD A{Table > 64MB?} A -->|No| B[Skip optimization] A -->|Yes| C{80%+ queries filter<br>on one date column?} C -->|Yes| D[Partition by date] C -->|No| E{High-cardinality<br>filters?} D --> F{Partitions > 10GB?} F -->|Yes| G[Partition + Cluster] F -->|No| H[Monthly partition<br>+ Cluster] E -->|Yes| I[Cluster only] E -->|No| J[Case by case]Use partitioning alone when
Your queries consistently filter on a single date/timestamp column (80% or more of queries include a date range filter). Partitioning gives you:
- Predictable cost estimates before execution. Partition pruning tells you exactly how many bytes will be scanned upfront (clustering estimates aren’t finalized until the query completes).
- Partition lifecycle management. You can set automatic expiration (
partition_expiration_days), delete entire date ranges efficiently, or write to specific partitions without touching others. - Sufficient partition size. Each partition should average at least 10GB. Smaller partitions create metadata overhead without proportional benefit.
Use clustering alone when
Use clustering when partitioning doesn’t fit your access patterns or table size:
- Table is 64MB–10GB. Too small for meaningful partitions, but large enough for block pruning benefits.
- Multiple filter columns with no single dominant one. If queries filter on
user_id,product_id,region, andstatusin various combinations, clustering handles this; partitioning can’t. - High-cardinality filters. Columns like
user_idorsession_idhave too many distinct values for partitioning (you’d hit the 10,000 partition limit). Clustering handles high cardinality gracefully. - Partitioning would create tiny partitions. If daily data volume is 100MB, daily partitions are wasteful. Cluster instead.
Use both when
Combine both when you have large tables with multi-dimensional query patterns:
- Tables exceeding 10GB per partition where queries filter on date and other dimensions.
- Event tables where the standard pattern is: partition by transaction/event date, cluster by commonly filtered dimension keys.
- You need cost predictability (partition) plus granular filtering (cluster) for columns beyond the partition key.
Size thresholds that actually matter
Google’s documentation states that tables or partitions larger than 64MB benefit from clustering (smaller data sees “negligible” improvement). Google’s Recommender system uses different thresholds: it suggests partitioning for tables exceeding 100GB and clustering for tables exceeding 10GB.
In practice, 1GB is the minimum where you’ll see meaningful clustering improvements in production.
| Table/Partition Size | Recommendation |
|---|---|
| < 64MB | Skip optimization |
| 64MB – 1GB | Clustering only if heavily filtered |
| 1GB – 10GB | Clustering recommended |
| 10GB – 100GB | Both clustering + partitioning |
| > 100GB | Both mandatory; enforce partition filters |
The small partitions trap
Daily partitioning with 100MB of daily data creates 365 tiny partitions per year. Each partition adds metadata overhead, and BigQuery must stitch them together at query time.
Fix this by using monthly partitioning combined with clustering on the date column for sub-month granularity. You get the lifecycle management benefits of partitioning without the overhead of hundreds of small partitions.
Real-world configuration patterns
Different data domains call for different optimization strategies. Here’s what works in practice.
Event data (GA4, clickstream)
GA4 exports create daily sharded tables (events_YYYYMMDD). For consolidated event tables or custom analytics models:
{{ config( materialized='incremental', partition_by={ 'field': 'event_date', 'data_type': 'date', 'granularity': 'day' }, cluster_by=['event_name', 'user_pseudo_id', 'traffic_source'], require_partition_filter=true) }}Why this works: Queries filter by date range first (partition pruning), then drill down by event type and user (clustering). The traffic_source cluster column supports attribution analysis without additional scans.
Marketing and advertising data
Google Ads and Facebook Ads data typically arrives with campaign hierarchy (campaign → ad group → ad):
{{ config( materialized='incremental', partition_by={ 'field': 'report_date', 'data_type': 'date', 'granularity': 'day' }, cluster_by=['campaign_id', 'ad_group_id', 'ad_id']) }}Why this works: Marketers analyze performance within date ranges, drilling down through the campaign hierarchy. Clustering column order matches the drill-down path.
Multi-tenant SaaS
For shared tables serving multiple customers, the critical decision is where tenant_id goes:
{{ config( materialized='incremental', partition_by={ 'field': 'created_date', 'data_type': 'date', 'granularity': 'day' }, cluster_by=['tenant_id', 'user_id', 'entity_type'], require_partition_filter=true) }}Why this works: tenant_id as the first clustering column ensures efficient tenant isolation without requiring separate tables per customer. All queries for a specific tenant benefit from block pruning.
Time-series and IoT
For high-volume sensor data with device-specific queries:
{{ config( materialized='incremental', partition_by={ 'field': 'measurement_date', 'data_type': 'date', 'granularity': 'day' -- or 'hour' for very high volume with <6 month retention }, cluster_by=['sensor_id', 'device_type', 'location']) }}Why this works: Device-specific queries within time windows are the dominant pattern. Clustering on sensor_id first handles the high cardinality of device identifiers.
dbt-bigquery configuration
The dbt-bigquery adapter provides direct configuration for partitioning and clustering.
Basic configuration
{{ config( materialized='incremental', partition_by={ 'field': 'event_date', 'data_type': 'date', 'granularity': 'day' }, cluster_by=['user_id', 'event_type'], require_partition_filter=true, partition_expiration_days=365) }}
SELECT DATE(event_timestamp) AS event_date, user_id, event_type, event_paramsFROM {{ source('analytics', 'raw_events') }}{% if is_incremental() %} WHERE DATE(event_timestamp) >= _dbt_max_partition{% endif %}For integer range partitioning:
partition_by={ 'field': 'customer_id', 'data_type': 'int64', 'range': { 'start': 0, 'end': 1000000, 'interval': 1000 }}Incremental strategy interactions
Your choice of incremental strategy affects how partitioning and clustering behave during model runs.
Merge strategy (the default) performs a full scan of both source and destination tables to identify matching rows. Clustering significantly improves merge performance by reducing the rows BigQuery must compare. Cluster on your unique_key columns when using merge.
Insert_overwrite strategy replaces entire partitions rather than merging individual rows. This requires partition_by configuration but not unique_key. For time-series data with immutable records, insert_overwrite is more efficient than merge:
{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={ 'field': 'event_date', 'data_type': 'date' }, cluster_by=['event_name', 'user_id']) }}Microbatch strategy (introduced in dbt 1.9) processes time-based batches independently with built-in retry and backfill capabilities. On BigQuery, it uses insert_overwrite under the hood:
{{ config( materialized='incremental', incremental_strategy='microbatch', event_time='created_at', batch_size='day', partition_by={ 'field': 'created_at', 'data_type': 'timestamp', 'granularity': 'day' }) }}The copy_partitions optimization
For insert_overwrite with ingestion-time partitioning, enabling copy_partitions uses BigQuery’s copy table API instead of MERGE statements. This eliminates insert costs and provides roughly 40% faster performance on large partitions:
{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={ 'field': 'created_date', 'data_type': 'timestamp', 'granularity': 'day', 'time_ingestion_partitioning': true, 'copy_partitions': true }) }}Critical dbt gotchas
_dbt_max_partition is BigQuery SQL, not Jinja. Use it directly in your SQL without double brackets:
-- CorrectWHERE event_date >= _dbt_max_partition
-- WrongWHERE event_date >= {{ _dbt_max_partition }}Changing the partition column causes data loss. When you modify partition_by on an existing incremental table, dbt’s insert_overwrite deletes partitions based on the new column’s values—which may not align with existing data. Always run --full-refresh when modifying partition configuration.
Data type mismatches break pruning silently. If your partition column is DATE but you filter with a TIMESTAMP literal, pruning won’t occur. Be explicit about types in your filter conditions.
Anti-patterns to avoid
Each of these mistakes can eliminate the performance benefits you’re trying to achieve.
Over-partitioning
Symptom: Hundreds or thousands of partitions, each under 1GB.
Problem: BigQuery must stitch small partitions together at query time. The metadata overhead negates any speed improvement.
Fix: Use coarser granularity (monthly instead of daily) combined with clustering for finer time-based filtering.
Wrong clustering column order
Symptom: Queries filter on the second or third clustering column without including the first.
Problem: Clustering works hierarchically. Skipping the first column means no block pruning occurs.
Fix: Always place the most frequently filtered column first. Audit your actual query patterns—don’t guess.
Functions on partition columns
Symptom: Filters like WHERE DATE(event_timestamp) = '2025-01-10' on a TIMESTAMP-partitioned table.
Problem: Wrapping the partition column in a function prevents partition pruning.
Fix: Create a dedicated DATE column for partitioning, or ensure your filter matches the partition column’s type exactly.
Dynamic filter expressions
Symptom: Partition filters that depend on subqueries or runtime values.
Problem: BigQuery can’t prune partitions when the filter value isn’t known at query planning time.
Fix: Materialize filter values in CTEs or separate queries. For incremental models, use _dbt_max_partition which BigQuery can evaluate.
Missing require_partition_filter
Symptom: Accidental full table scans on multi-terabyte tables.
Problem: One careless query without a date filter can cost hundreds of dollars.
Fix: Always enable require_partition_filter=true on large production tables. It forces queries to include a partition filter, preventing expensive mistakes.
Comparing clustered columns to other columns
Symptom: Filters like WHERE customer_id = order_id.
Problem: Block pruning requires comparison against constant values. Column-to-column comparisons scan all blocks.
Fix: If you need this pattern, consider whether your data model should be restructured.
Validating your choices
Don’t optimize blindly. Measure before and after.
Before deployment
Check current table stats using INFORMATION_SCHEMA:
SELECT table_name, total_rows, total_logical_bytes / POW(1024, 3) AS size_gb, ARRAY_TO_STRING(clustering_columns, ', ') AS clusteringFROM `project.dataset.INFORMATION_SCHEMA.TABLE_STORAGE`WHERE table_name = 'your_table'Use BigQuery’s Recommender (the lightbulb icon in the console) to see ML-based suggestions for partitioning and clustering based on your actual query patterns.
Dry-run queries to see bytes scanned before execution:
-- In BigQuery console, enable "Dry run" mode-- Or use bq command line:bq query --dry_run --use_legacy_sql=false 'SELECT ...'After deployment
Monitor bytes scanned and slot time in BigQuery’s query history. Compare before/after for representative queries.
Check query execution details for partition pruning effectiveness:
SELECT total_bytes_processed, total_bytes_billed, total_slot_msFROM `region-us`.INFORMATION_SCHEMA.JOBSWHERE job_id = 'your-job-id'Watch for partition filter warnings in your dbt logs. They indicate queries that couldn’t prune partitions.
Conclusion
The partition-vs-clustering decision comes down to a few key principles.
- Partition when you need cost predictability and lifecycle management on a single time-based column
- Cluster when you filter on multiple columns or high-cardinality values
- Combine both for large tables with multi-dimensional query patterns
Size thresholds matter: clustering benefits start around 64MB but become significant at 1GB+, while partitions should average at least 10GB each.
For dbt users, match your incremental strategy to your data pattern. Insert_overwrite with copy_partitions provides the most efficient path for time-series data. Merge benefits from clustering on join columns.
Most importantly: measure your actual query patterns before optimizing. BigQuery’s Recommender, INFORMATION_SCHEMA views, and query execution details give you the data to validate your choices. Start with the configuration suggested by your dominant query patterns, monitor bytes scanned and slot time, and iterate.