ServicesAboutNotesContact Get in touch →
EN FR
Note

BigQuery Partitioning Configuration Patterns

Domain-specific partitioning and clustering configurations for BigQuery in dbt -- event data, marketing, multi-tenant SaaS, and IoT patterns with rationale.

Planted
bigquerydbtdata engineeringdata modeling

The decision framework determines whether to partition, cluster, or both. This note covers domain-specific implementations across four data types. Each pattern includes a complete dbt config block and rationale for column choices.

Event Data (GA4, Clickstream)

GA4 exports create daily sharded tables (events_YYYYMMDD). For a full reference on this export structure, see the GA4 BigQuery export schema. For consolidated event tables or custom analytics models:

{{ config(
materialized='incremental',
partition_by={
'field': 'event_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by=['event_name', 'user_pseudo_id', 'traffic_source'],
require_partition_filter=true
) }}

Why this works: Queries filter by date range first (partition pruning), then drill down by event type and user (clustering). The traffic_source cluster column supports attribution analysis without additional scans.

The require_partition_filter=true is non-negotiable for event tables. GA4 tables grow quickly — a moderately-trafficked site generates several GB per month. One accidental full table scan from an analyst exploring data costs real money. The filter requirement makes that impossible.

The clustering column order follows the typical analysis path: “show me page_view events for user X from organic traffic.” If your team primarily analyzes by user first (e.g., identity resolution workloads), swap user_pseudo_id to the first position. Column order should match your dominant query patterns, not a generic recommendation.

Marketing and Advertising Data

Google Ads and Meta Ads data typically arrives with a campaign hierarchy (campaign —> ad group —> ad):

{{ config(
materialized='incremental',
partition_by={
'field': 'report_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by=['campaign_id', 'ad_group_id', 'ad_id']
) }}

Why this works: Marketers analyze performance within date ranges, drilling down through the campaign hierarchy. Clustering column order matches the drill-down path — campaign-level reports filter on campaign_id, ad group analysis adds ad_group_id, and creative performance adds ad_id.

This hierarchical ordering is a textbook case of why clustering column order matters. A query filtering only on ad_id (skipping campaign_id and ad_group_id) won’t benefit from clustering. But that query pattern is rare in practice — analysts almost always start from campaign-level and drill down.

For cross-platform ad reporting where you’ve unified data from Google Ads, Meta, LinkedIn, and other platforms into a single table, consider adding platform as the first clustering column:

cluster_by=['platform', 'campaign_id', 'ad_group_id']

This serves the common pattern of filtering by platform first (“show me all Google Ads performance”), and the three remaining columns still cover the drill-down path.

Multi-Tenant SaaS

For shared tables serving multiple customers, the critical decision is where tenant_id goes:

{{ config(
materialized='incremental',
partition_by={
'field': 'created_date',
'data_type': 'date',
'granularity': 'day'
},
cluster_by=['tenant_id', 'user_id', 'entity_type'],
require_partition_filter=true
) }}

Why this works: tenant_id as the first clustering column ensures efficient tenant isolation without requiring separate tables per customer. All queries for a specific tenant benefit from block pruning. This is more practical than partitioning by tenant_id (which would quickly hit the 10,000 partition limit with many tenants) and more maintainable than separate tables per tenant.

The require_partition_filter is critical here for security as much as cost. Without it, a poorly-written query could accidentally scan all tenants’ data — a compliance nightmare in addition to a cost problem.

If your tenants vary dramatically in size (one tenant has 90% of the data), the clustering benefit is uneven. The large tenant’s data may span many blocks, reducing the pruning ratio. In extreme cases, consider a separate table for the largest tenants and a shared table for the rest.

Time-Series and IoT

For high-volume sensor data with device-specific queries:

{{ config(
materialized='incremental',
partition_by={
'field': 'measurement_date',
'data_type': 'date',
'granularity': 'day' -- or 'hour' for very high volume with <6 month retention
},
cluster_by=['sensor_id', 'device_type', 'location']
) }}

Why this works: Device-specific queries within time windows are the dominant pattern. Clustering on sensor_id first handles the high cardinality of device identifiers — there might be millions of sensors, which makes sensor_id unsuitable for partitioning (10,000 partition limit) but perfect for clustering.

The granularity decision depends on volume and retention. Hourly partitioning makes sense when:

  • Daily data volume exceeds 100GB (each hourly partition is still several GB)
  • Retention is short (under 6 months, keeping you well under the 10,000 partition limit)
  • Queries typically target narrow time windows (a few hours)

For lower-volume IoT systems (under 10GB/day), daily partitioning is the safer default. If daily partitions end up under 1GB, switch to monthly partitioning — the small partitions trap applies here.

Choosing the Right Pattern

The domain-specific configs above follow the same underlying logic from the decision framework:

  1. Partition by the time column that appears in 80%+ of queries — usually event_date, report_date, created_date, or measurement_date.
  2. Cluster by the columns that appear most in WHERE clauses, ordered from most to least frequent.
  3. Match clustering order to the natural drill-down path in your domain — campaign hierarchy for marketing, tenant then user for SaaS, sensor then device type for IoT.
  4. Enable require_partition_filter on any table that could cause expensive accidents if scanned fully.

Verify actual query patterns in INFORMATION_SCHEMA.JOBS before applying these patterns. The appropriate configuration depends on how data is actually queried, not assumptions.

For the incremental strategy configurations that complement these partitioning patterns (insert_overwrite, merge with predicates, microbatch), see dbt Incremental Strategy Configuration Patterns.