ServicesAboutNotesContact Get in touch →
EN FR
Note

Microbatch Backfill and Full Refresh Protection

How to use dbt's built-in microbatch backfill commands, retry failed batches, and protect large incremental tables from accidental full refreshes.

Planted
dbtincremental processingdata engineering

With traditional incremental models, reprocessing a specific date range requires custom scripts, dbt variable overrides, or targeted full refreshes with manual partition management. Microbatch provides built-in CLI flags for targeted backfill, batch-level retry on failure, and full-refresh protection.

Targeted Backfill With CLI Flags

The --event-time-start and --event-time-end flags let you specify exact date ranges to reprocess:

Terminal window
# Reprocess September 1-3, 2024
dbt run --select int__sessions_aggregated \
--event-time-start "2024-09-01" \
--event-time-end "2024-09-04"

dbt generates separate queries for each batch within that range. With batch_size='day', this runs three batch queries (September 1, 2, and 3). Each batch replaces the corresponding period in the target table using the underlying warehouse strategyinsert_overwrite on BigQuery, delete+insert on Snowflake, replace_where on Databricks.

The end date is exclusive, matching standard interval conventions. --event-time-end "2024-09-04" processes up to but not including September 4th.

These flags work with model selection, so you can backfill a single model or a group:

Terminal window
# Backfill all models with the microbatch+ tag
dbt run --select tag:microbatch --event-time-start "2024-09-01" --event-time-end "2024-10-01"
# Backfill a model and all its downstream dependencies
dbt run --select int__sessions_aggregated+ --event-time-start "2024-09-01" --event-time-end "2024-09-04"

This replaces the custom backfill patterns that teams typically build with traditional incremental models — variable overrides, shell scripts that loop through dates, or one-off SQL that manually deletes and reinserts partitions.

Batch-Level Retry

When a microbatch run fails partway through, dbt retry picks up from the failed batch instead of starting over:

Terminal window
# Initial run processes 30 days, fails on day 17
dbt run --select int__sessions_aggregated \
--event-time-start "2024-09-01" \
--event-time-end "2024-10-01"
# Retry only the failed batch (day 17) and continue from there
dbt retry

With traditional incremental, a failure in a 30-day backfill means rerunning all 30 days. The first 16 days of work are wasted. With microbatch, days 1-16 are already committed to the target table, and retry starts from day 17.

This is especially valuable for:

  • Large historical backfills where reprocessing from scratch costs real money (BigQuery bytes scanned, Snowflake compute credits)
  • Flaky source systems where intermittent failures are common and retrying a single batch is much cheaper than retrying everything
  • Timeout-prone queries where individual batches might fail due to resource limits but most batches complete fine

The retry mechanism works because each batch is a self-contained operation. dbt tracks which batches succeeded and which failed in the run artifacts. The dbt retry command reads those artifacts and re-executes only the failures.

Bounded Full Refresh

Traditional --full-refresh rebuilds a table from scratch — from the begin date to now, processing everything. For a table with years of history, that’s expensive and slow. Microbatch lets you scope a full refresh to a specific time range:

Terminal window
# Full refresh only January 2024
dbt run --full-refresh \
--select int__sessions_aggregated \
--event-time-start "2024-01-01" \
--event-time-end "2024-02-01"

This rebuilds only the specified range while leaving the rest of the table intact. It’s the difference between “rebuild everything from 2020” and “rebuild just the month that had bad data.”

This is particularly useful when:

  • A source system corrected historical data for a specific period
  • You changed transformation logic and need to reprocess a bounded window
  • A schema change requires reprocessing but only affects data after a certain date

Protecting Against Accidental Full Refreshes

Large incremental tables can be extremely expensive to rebuild. A careless dbt run --full-refresh on a table with 3 years of event data can take hours and cost hundreds of dollars in compute. Microbatch provides a safety net:

{{ config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='event_occurred_at',
batch_size='day',
begin='2020-01-01',
full_refresh=false
) }}

With full_refresh=false, running dbt run --full-refresh on this model fails with an error rather than silently rebuilding the entire table. This is a guardrail, not a permanent lock — you can still do bounded refreshes using --event-time-start and --event-time-end.

This protection matters most for:

  • Production tables where an accidental full refresh would cause hours of downtime while the table rebuilds
  • Large-volume event tables where the rebuild cost is significant (think terabytes of BigQuery scans at $6.25/TB)
  • Tables with no natural rebuild window — some tables are too large to rebuild even overnight

The full_refresh=false setting applies at the model level. You can set it globally in dbt_project.yml for all microbatch models and override it per-model where full refreshes are acceptable:

dbt_project.yml
models:
my_project:
marts:
+full_refresh: false # Protect all mart models
staging:
+full_refresh: true # Staging can be rebuilt freely

Operational Patterns

Scheduled Backfill After Source Outages

When a source system has a known outage, you can schedule a targeted backfill for the affected window once the source recovers:

Terminal window
# Source was down March 15-17, data backfilled on March 20
dbt run --select tag:source_dependent \
--event-time-start "2024-03-15" \
--event-time-end "2024-03-18"

Chunked Historical Rebuild

For very large tables where processing the entire history at once would exceed resource limits or be unacceptably slow, process in monthly chunks:

Terminal window
# Rebuild 2024 one month at a time
for month in 01 02 03 04 05 06 07 08 09 10 11 12; do
dbt run --select int__sessions_aggregated \
--event-time-start "2024-${month}-01" \
--event-time-end "2024-$((10#$month + 1))-01" || break
done

Each chunk processes independently. If month 7 fails, months 1-6 are already committed, and you can retry just month 7.

Validating Backfill Results

After a backfill, compare the reprocessed batches against expected counts or aggregates:

-- Check that backfilled days have expected row counts
SELECT
DATE(session__started_at) AS batch_date,
COUNT(*) AS row_count
FROM int__sessions_aggregated
WHERE session__started_at >= '2024-09-01'
AND session__started_at < '2024-09-04'
GROUP BY 1
ORDER BY 1;

This is basic, but it catches the most common backfill failures: empty batches (source data not available), duplicate batches (lookback overlap not handled correctly), and count discrepancies versus the source.

Comparison With Traditional Backfill Approaches

AspectTraditional IncrementalMicrobatch
Backfill mechanismVariable overrides, custom scriptsBuilt-in --event-time-start/end flags
Failure recoveryReprocess entire rangeRetry only failed batches
Full refresh scopeEntire tableBounded to specific date range
ProtectionNone built-infull_refresh=false config
GranularityWhatever your script handlesPer-batch (hour/day/month)

With microbatch, backfill capability is part of the model configuration rather than a separate set of scripts. The standard dbt CLI handles targeted backfill without custom procedures.