ServicesAboutNotesContact Get in touch →
EN FR
Note

Migrating Incremental Models to Microbatch

How to convert traditional dbt incremental models to the microbatch strategy — step-by-step migration, side-by-side code examples, and first-run considerations.

Planted
dbtincremental processingdata engineering

Converting an existing incremental model to microbatch is usually straightforward. The mechanical changes are simple: remove the is_incremental() block, add event_time and batch_size configuration, and let dbt handle the filtering. The harder part is knowing when the conversion makes sense and handling the first run after migration.

When Migration Makes Sense

Not every incremental model benefits from microbatch. Before converting, check that your model meets these criteria:

  1. Clear timestamp column: Your model needs a column that cleanly defines when each record belongs in time. This becomes your event_time. Models that update records across arbitrary time ranges — like SCD Type 2 or complex CDC patterns — don’t fit the batch model.

  2. Hourly or coarser granularity: Microbatch’s minimum batch_size is hour. If your current model processes sub-hourly windows (15-minute or 5-minute intervals), microbatch won’t work. See the tradeoffs note for details.

  3. Time-bounded processing: Your current is_incremental() logic should already be filtering by a time column. If your incremental filter is based on a non-time column (like a monotonically increasing ID or a status flag), microbatch isn’t the right strategy.

  4. Backfill or retry value: Microbatch adds overhead (sequential batch execution) in exchange for batch-level retry and built-in backfill. If your model never fails and you never need to reprocess specific date ranges, that trade isn’t worth it.

Side-by-Side: Before and After

Here’s a typical traditional incremental model and its microbatch equivalent.

Traditional Incremental (delete+insert)

{{ config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='date_day'
) }}
SELECT
event_id,
user_id,
event_occurred_at,
event_name,
event_properties
FROM {{ ref('base__app__events') }}
{% if is_incremental() %}
WHERE date_day >= (
SELECT {{ dbt.dateadd("day", -3, "max(date_day)") }}
FROM {{ this }}
)
{% endif %}

Microbatch Equivalent

{{ config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='event_occurred_at',
batch_size='day',
lookback=3,
begin='2020-01-01'
) }}
SELECT
event_id,
user_id,
event_occurred_at,
event_name,
event_properties
FROM {{ ref('base__app__events') }}

The differences:

  • is_incremental() block removed: dbt handles time filtering automatically based on event_time and batch_size.
  • event_time added: Points to the timestamp column that defines batch boundaries. This replaces the manual WHERE date_day >= ... logic.
  • lookback replaces manual window: The lookback=3 parameter tells dbt to reprocess the previous 3 batches, exactly like the dateadd("day", -3, ...) pattern in the original.
  • begin date added: Required for microbatch. Defines the earliest date dbt will process when doing a full build.
  • unique_key removed: Not required for most warehouses. On BigQuery, microbatch uses insert_overwrite under the hood (partition replacement, not row matching). On Snowflake, it uses delete+insert. Only PostgreSQL’s microbatch requires unique_key because it delegates to merge.

Step-by-Step Migration

1. Map Your Existing Logic

Before changing any code, identify:

  • Which column is your time filter? This becomes event_time. It should be the column in your WHERE clause that filters records by time.
  • What’s your current lookback window? If you’re subtracting 3 days in your is_incremental() block, set lookback=3.
  • What granularity fits? If you process daily, use batch_size='day'. If your model runs hourly and processes the last hour, use batch_size='hour'.
  • What’s the earliest date in your table? This becomes begin. Check SELECT MIN(event_time_column) FROM your_table.

2. Add Warehouse-Specific Configuration

Microbatch uses different underlying strategies per warehouse. Some require extra config:

BigQuery requires partition_by matching your event_time:

{{ config(
materialized='incremental',
incremental_strategy='microbatch',
partition_by={
"field": "event_occurred_at",
"data_type": "timestamp",
"granularity": "day"
},
event_time='event_occurred_at',
batch_size='day',
begin='2020-01-01'
) }}

The partition_by granularity should match your batch_size. Mismatches won’t break the model but reduce efficiency — a monthly partition with daily batches means multiple batches write to the same partition, losing the atomic partition-replacement benefit.

Snowflake and Databricks need no extra configuration. The underlying strategies (delete+insert and replace_where respectively) work without additional settings.

PostgreSQL requires unique_key because microbatch delegates to merge:

{{ config(
materialized='incremental',
incremental_strategy='microbatch',
unique_key='event_id',
event_time='event_occurred_at',
batch_size='day',
begin='2020-01-01'
) }}

3. Configure Upstream event_time

If your upstream models (the ones referenced via ref()) have a timestamp column, configure event_time on them too. This enables automatic upstream filtering — dbt will filter source tables to match the current batch window, preventing full table scans.

schema.yml
models:
- name: base__app__events
config:
event_time: event_occurred_at

This isn’t required for migration, but it’s a meaningful performance optimization. Without it, each batch reads the full upstream table.

4. Remove Redundant Logic

Delete the is_incremental() block entirely. Any manual time filtering, lookback calculations, or conditional logic related to incremental processing is now handled by the microbatch configuration.

Also remove:

  • Manual dateadd / DATE_SUB lookback calculations
  • References to {{ this }} for determining the max processed date
  • Any unique_key configuration (unless on PostgreSQL)
  • Custom backfill variable overrides (microbatch has built-in backfill)

5. Validate With Unit Tests

Before deploying, validate that the microbatch version produces the same results as the traditional version. Run both side by side and compare outputs using unit tests or manual comparison queries:

-- Compare row counts by day
SELECT
DATE(event_occurred_at) AS event_date,
COUNT(*) AS row_count
FROM traditional_model
GROUP BY 1
EXCEPT DISTINCT
SELECT
DATE(event_occurred_at) AS event_date,
COUNT(*) AS row_count
FROM microbatch_model
GROUP BY 1;

If the results diverge, the most common causes are:

  • Timezone differences: Microbatch uses UTC for batch boundaries. If your original model used local time, records near midnight will land in different batches.
  • Lookback mismatch: Make sure lookback matches your original window size exactly.
  • Deduplication changes: If your original model relied on unique_key for deduplication during merge, and the new microbatch delegates to insert_overwrite (BigQuery), you may need explicit deduplication in your SELECT.

Handling the First Run

When you first run after converting to microbatch, you have two options:

Option A: Full Refresh

The simplest approach. Run dbt run --full-refresh --select your_model to rebuild the table from scratch using microbatch logic. dbt processes every batch from begin to now, sequentially.

For small-to-medium tables, this is fine. For very large tables, it’s slow — a table with 5 years of daily data means 1,825 sequential batch queries. Consider option B.

Option B: Bounded Rebuild in Chunks

Use --event-time-start and --event-time-end to rebuild in manageable chunks:

Terminal window
# Rebuild 2024 data
dbt run --full-refresh --select int__sessions_aggregated \
--event-time-start "2024-01-01" --event-time-end "2024-07-01"
dbt run --full-refresh --select int__sessions_aggregated \
--event-time-start "2024-07-01" --event-time-end "2025-01-01"

This is especially useful on BigQuery where each batch’s compute cost is proportional to the data scanned. Processing in smaller chunks keeps individual query costs manageable and lets you retry specific chunks if something fails.

Option C: Keep Existing Data

If your existing table structure is compatible (same columns, same partitioning), you may not need a full refresh at all. The next incremental run will use microbatch logic to process new batches from where the table left off. dbt determines the “last processed” point from the existing data in the table.

This works best when:

  • The table is already partitioned by the same column you’re using as event_time
  • You haven’t changed any transformation logic, only the incremental strategy
  • You’re comfortable with the existing historical data staying as-is

Common Migration Pitfalls

Forgetting begin: Without begin, dbt doesn’t know how far back to process on a full refresh. The model will error.

UTC boundary surprises: Your original model may have used local-time batch boundaries (midnight Eastern, midnight CET). Microbatch uses UTC exclusively. Records near the end of the local business day will land in different batches than before. If this matters for your reporting, convert timestamps to UTC before setting them as event_time.

PostgreSQL unique_key requirement: Every other warehouse works without unique_key for microbatch, but PostgreSQL needs it because microbatch delegates to merge. Missing it causes errors, not silent failures.

Over-aggressive full_refresh=false: If you set full_refresh=false immediately after migration, you can’t easily do a full rebuild if something goes wrong. Wait until you’ve validated the microbatch behavior before adding this protection.