ServicesAboutNotesContact Get in touch →
EN FR
Note

dbt Microbatch Strategy Tradeoffs

The practical limitations and design tradeoffs of dbt's microbatch incremental strategy — UTC assumptions, no sub-hourly batches, and sequential execution.

Planted
dbtbigquerysnowflakedatabricksincremental processingdata engineering

Microbatch (dbt 1.9+) is an incremental strategy with built-in backfill, batch-level retry, and declarative time-based configuration instead of is_incremental() boilerplate. It has three limitations that determine whether it fits a given model.

How Microbatch Works

Instead of writing conditional is_incremental() logic, you configure time-based batching declaratively:

{{ config(
materialized='incremental',
incremental_strategy='microbatch',
event_time='session_start',
begin='2020-01-01',
batch_size='day',
lookback=3
) }}
-- No is_incremental() block needed
SELECT
session_id,
session_start,
user_id,
session_duration
FROM {{ ref('base__analytics__sessions') }}

dbt generates separate queries for each batch period (hour, day, or month). On a normal incremental run, it processes only the latest unprocessed batches plus the lookback window. For backfills, you specify explicit time ranges:

Terminal window
# Targeted backfill
dbt run --event-time-start "2024-09-01" --event-time-end "2024-09-04"
# Retry only failed batches from the last run
dbt retry

The batch-level retry is especially valuable: if a 30-day backfill fails on day 17, you retry only day 17 instead of all 30 days.

Tradeoff 1: UTC Timezone Assumption

Microbatch uses UTC for all time calculations. Batch boundaries always fall at midnight UTC (for daily batches) or the top of the hour (for hourly batches). There’s no configuration to change this.

If your event_time column is stored in a local timezone — say US/Eastern — records near midnight local time will land in unexpected batches. An event at 11:30 PM Eastern on March 15 is 4:30 AM UTC on March 16, so it lands in the March 16 batch, not March 15.

For most analytical workloads this doesn’t matter. Aggregations that roll up to monthly or weekly granularity absorb the boundary offset. But if your business logic depends on records landing in the correct local-date batch — say, daily revenue reporting where you close the books at midnight local time — the UTC assumption creates off-by-one-day errors.

The fix: Convert your event_time column to UTC before using microbatch. If your source data is in a local timezone:

SELECT
session_id,
CONVERT_TIMEZONE('US/Eastern', 'UTC', session_start) AS session_start_utc,
user_id,
session_duration
FROM {{ ref('base__analytics__sessions') }}

Then configure event_time='session_start_utc'. This ensures batch boundaries align with the actual UTC timestamps, and downstream reporting can convert back to local time as needed.

If converting isn’t feasible, accept that batch boundaries won’t align with your business day and design downstream logic accordingly.

Tradeoff 2: No Sub-Hourly Granularity

The smallest batch_size is hour. If you need to process data in 15-minute or 5-minute windows — common for real-time dashboards, alerting systems, or near-real-time reporting — microbatch won’t work.

This limitation is architectural. Microbatch generates one SQL query per batch, and each query has overhead: compilation, warehouse scheduling, query execution, result merging. At sub-hourly granularity, that per-query overhead dominates total runtime. A 24-hour backfill at 5-minute granularity would be 288 separate queries.

When this matters: Systems that need data freshness under 1 hour. Streaming-adjacent workloads. Models that feed operational dashboards with latency SLAs.

The alternative: Use traditional incremental logic with a manual time filter. You lose microbatch’s built-in backfill and retry capabilities, but you can filter to whatever granularity your use case requires:

{% if is_incremental() %}
WHERE event_timestamp >= (
SELECT DATEADD(MINUTE, -15, MAX(event_timestamp))
FROM {{ this }}
)
{% endif %}

Tradeoff 3: Sequential Execution

Each batch runs one after another, not in parallel. A 30-day backfill with batch_size='day' means 30 separate query executions, sequentially.

A single query processing all 30 days at once is almost always faster than 30 separate queries. Each query has fixed overhead: Jinja compilation, query planning, warehouse scheduling. Multiply that by 30 and the overhead accumulates. In practice, a 30-day traditional incremental run might take 5 minutes, while the same range as microbatch takes 15-20 minutes.

The tradeoff is explicit: microbatch trades speed for failure isolation. If day 17 of a 30-day traditional incremental run fails, you reprocess all 30 days. With microbatch, you retry only day 17. For workloads where failure is common (flaky source systems, timeout-prone queries) or reprocessing is expensive (very large daily volumes), the retry benefit can outweigh the speed cost.

When sequential execution hurts most:

  • Large historical backfills (months or years of data)
  • Models with low per-batch volume where per-query overhead dominates
  • Time-sensitive backfill scenarios where you need data ready quickly

When it’s acceptable:

  • Daily or hourly incremental runs processing only the latest batch
  • Workloads with occasional failures where batch-level retry saves significant reprocessing
  • Teams that value operational simplicity over raw speed

How Microbatch Delegates to Warehouse Strategies

Under the hood, microbatch doesn’t implement its own merge logic. It delegates to existing strategies per warehouse:

  • BigQuery: Uses insert_overwrite (requires partition_by matching your batch_size)
  • Snowflake: Uses delete+insert
  • Databricks: Uses replace_where

This means microbatch inherits the properties of the underlying strategy. On BigQuery, it’s atomic and partition-based. On Snowflake, it has the same atomicity caveats as delete+insert. Knowing the underlying strategy helps you understand failure modes and performance characteristics without reading microbatch-specific documentation.

Comparison With Traditional Incremental

AspectTraditional IncrementalMicrobatch
Query structureSingle SQL for all new dataSeparate SQL per batch
Batch definitionUser-defined in SQLConfigured via event_time, batch_size
Retry granularityEntire modelIndividual batches
BackfillCustom logic or full refreshBuilt-in --event-time-start/end
Time filteringis_incremental() blockAutomatic
Minimum granularityWhatever you write in SQL1 hour
Execution speedSingle query (faster)Sequential batches (slower)

When Microbatch Is the Right Choice

Microbatch fits best when:

  1. Your data has a clear timestamp column and processes in natural time windows
  2. You want built-in backfill without writing custom scripts or one-off SQL
  3. Batch-level retry would save meaningful reprocessing time when failures occur
  4. You prefer declarative configuration over writing is_incremental() logic
  5. Hourly or daily granularity is sufficient for your use case

It doesn’t fit when you need sub-hourly processing, when backfill speed is critical, or when your data doesn’t have a clean timestamp column to partition on.

For a broader look at how microbatch compares to other strategies, see the strategy decision framework. For implementation details and backfill commands, see the microbatch strategy guide.