ServicesAboutNotesContact Get in touch →
EN FR
Note

Microbatch Automatic Upstream Filtering

How dbt's microbatch strategy automatically filters upstream models by event_time, reducing full table scans — and when to opt out with .render().

Planted
dbtincremental processingcost optimization

One of microbatch’s most useful features doesn’t get enough attention: automatic upstream filtering. When a microbatch model references another model via ref(), dbt checks whether that upstream model also has event_time configured. If it does, dbt automatically filters the upstream data to match the current batch’s time window. No extra SQL, no manual WHERE clauses.

This matters for cost and performance. Without automatic filtering, each batch would read the entire upstream table and then discard everything outside the batch window. On a large event table, that’s a full table scan per batch — multiplied by however many batches you’re processing.

How It Works

Say you have a microbatch model that aggregates page views into sessions:

{{ config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='session__started_at',
batch_size='day',
lookback=3,
begin='2023-01-01'
) }}
SELECT
session_id,
user_id,
MIN(viewed_at) AS session__started_at,
MAX(viewed_at) AS session__ended_at,
COUNT(*) AS session__page_views
FROM {{ ref('base__app__page_views') }}
GROUP BY session_id, user_id

If base__app__page_views has event_time='viewed_at' configured in its own model definition, dbt injects a time filter on viewed_at for each batch. When processing the January 10th batch, dbt filters base__app__page_views to only return rows where viewed_at falls within the January 10th window.

The generated SQL effectively becomes:

-- What dbt generates under the hood (per batch)
SELECT ...
FROM base__app__page_views
WHERE viewed_at >= '2024-01-10 00:00:00'
AND viewed_at < '2024-01-11 00:00:00'

You never write this filter. dbt handles it based on the event_time configuration of the upstream model and the batch_size of the consuming model.

Dimension Tables and Full Reads

Not every upstream model has event_time. Dimension tables — products, users, configuration lookups — typically don’t have a meaningful timestamp for time-based filtering. When a microbatch model references a model without event_time, dbt reads the full table for each batch.

-- This model has no event_time configured
-- dbt reads all rows for every batch
SELECT
product_id,
product_name,
product_category
FROM {{ ref('dim__products') }}

For small dimension tables (thousands to low millions of rows), this is fine. The full table read is fast and the data is needed in its entirety for lookups and joins. But if you’re referencing a large model without event_time — say, a denormalized fact table you’re using for enrichment — every batch triggers a full scan of that table. On BigQuery with on-demand pricing, those scans add up quickly across 30+ batches.

If a large upstream model genuinely doesn’t have a time dimension, consider whether it should be the source for a microbatch model at all. The cost profile of microbatch assumes that most upstream data can be time-filtered.

Opting Out With .render()

Sometimes you want to force a full table read even when the upstream model has event_time configured. The .render() method on a ref() call disables automatic filtering:

-- Force full table read, bypassing automatic event_time filtering
SELECT
product_id,
product_name,
product_category
FROM {{ ref('mrt__product__products').render() }}

When would you need this? The most common scenario is when your microbatch model needs to join against a model that has event_time but where you need the complete dataset, not just the current batch’s window. For example, a slowly-changing dimension table that happens to have an updated_at column configured as event_time — you don’t want dbt to filter it to only today’s updates when you need the full current state of every row.

Another case: enrichment models where the event_time represents when the enrichment happened, not when the underlying entity was relevant. Filtering to the current batch window would exclude records that were enriched earlier but are still valid for joining.

Interaction With lookback

The automatic upstream filtering respects the lookback configuration. With lookback=3 and batch_size='day', when processing today’s batch, dbt also reprocesses the previous three days. The upstream filter expands accordingly — base__app__page_views gets filtered to the four-day window (today plus three lookback days), not just today.

This means your lookback window applies consistently to both the model’s output and its inputs. Late-arriving upstream records within the lookback window are automatically included in reprocessing.

Cost Implications Across Warehouses

The cost impact of automatic upstream filtering varies by warehouse:

BigQuery: The most significant impact. BigQuery’s on-demand pricing charges per bytes scanned. Automatic filtering that leverages partition pruning on a time-partitioned upstream table can reduce scanned bytes by orders of magnitude. If your upstream table is partitioned by the same column used as event_time, each batch scans only the relevant partitions.

Snowflake: Snowflake’s micro-partition pruning benefits from time-based filters, but the cost model is slot-based rather than scan-based. The performance improvement is real (less data processed per batch), but the cost savings are less dramatic than on BigQuery since you’re paying for compute time, not bytes scanned.

Databricks: Similar to Snowflake. Delta Lake’s data skipping and Z-ordering can amplify the benefit of time-based filters, especially if the upstream table is optimized for the event_time column.

Designing for Automatic Filtering

To get the most out of this feature, configure event_time on all time-series models in your DAG, not just the microbatch models themselves. When your base layer models, intermediate aggregations, and staging models all declare their timestamp column as event_time, the automatic filtering cascades through the entire dependency chain.

# In your base model's YAML
models:
- name: base__app__page_views
config:
event_time: viewed_at
- name: base__app__events
config:
event_time: event_occurred_at

This is a one-time setup that enables automatic filtering for every microbatch model downstream. Without event_time configured on upstream models, each batch reads the full upstream table regardless of batch size.