Incremental Models in dbt

Incremental models process only the rows specified on subsequent runs, then merge or insert them into an existing table. On first run, the model builds like a regular table. From there, it reads only new or changed data, depending on the strategy and configuration.

The cost and runtime benefit is significant — a model that takes 45 minutes and scans 2TB on full refresh can drop to 2 minutes and 20GB incrementally. The trade-off is complexity: misconfigured incremental models produce silent data quality issues that compound over time.

How it Works

The is_incremental() macro returns True when three conditions are met:

The model already exists as a table in the warehouse
The --full-refresh flag isn’t passed
The model is configured with materialized='incremental'

{{ config(materialized='incremental') }}

SELECT event_id, event_timestamp, user_id
FROM {{ ref('base__events') }}

{% if is_incremental() %}
WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }})
{% endif %}

The generated SQL varies by your incremental_strategy. Merge strategies create a MERGE statement and compare rows. Insert_overwrite replaces entire partitions atomically. Microbatch runs separate queries for each time period.

Core Configuration

Three settings define your incremental behavior:

materialized=‘incremental’: Enables incremental logic.

unique_key: Defines the grain for deduplication. Without it, most strategies append duplicates. Can be a single column or composite key, but all columns must be non-null to match correctly during merge operations.

incremental_strategy: Determines how new data merges into the existing table. Options vary by warehouse—not all strategies work everywhere.

The Strategies

The right strategy depends on your data patterns, table size, and warehouse. My detailed strategy comparison covers implementation patterns and performance benchmarks.

Merge: Compares each incoming row against the destination using unique_key. Matching rows update; non-matching rows insert. Simple and atomic, but scans the entire destination table unless you add incremental_predicates. Above 100M rows, this full table scan becomes expensive. Works on BigQuery, Snowflake, and Databricks.

Delete+Insert: Deletes matching records first, then inserts fresh data. Two separate operations means there’s a brief window where deleted records are gone but new ones haven’t landed yet—not atomic like merge. Dramatically faster than merge at scale (3.4x faster at 500M rows on Snowflake). Snowflake and Databricks only.

Insert_Overwrite: Replaces entire partitions rather than comparing individual rows. Atomic and efficient for time-series data. On BigQuery, it intelligently replaces only affected partitions. On Snowflake, it replaces the entire table (not partition-safe). Available on BigQuery, Snowflake (full table only), and Databricks.

Microbatch (dbt 1.9+): Processes data in time-based batches without is_incremental() logic. Instead of one query, dbt runs separate queries for each configured period (hour, day, month). Built-in backfill support and batch-level retries. No conditional SQL needed—dbt handles filtering automatically.

Append: Append-only tables with no deduplication. Fastest but only works for truly immutable data. Snowflake and Databricks only.

When Incremental Actually Helps

Start with table materialization. Switch to incremental when:

Source data has millions or billions of rows
Transformations are computationally expensive
dbt runs are becoming visibly slow or costly

Tables under 10M rows rarely benefit from incremental complexity. The overhead often exceeds the savings. Merge performance noticeably degrades above 100M rows, and at 500M+, alternative strategies may be 3-5x faster.

Handling Late-Arriving Data

Late-arriving data—records that arrive after their event time period—is the primary reason incremental models drift from source truth. Without handling, your model slowly diverges from reality.

The standard solution is a lookback window: reprocess a rolling window of recent data on every run.

{% if is_incremental() %}
WHERE event_date >= (
    SELECT DATE_SUB(MAX(event_date), INTERVAL 3 DAY)
    FROM {{ this }}
)
{% endif %}

A 3-day window handles the majority of late arrivals in most systems. Smaller windows cost less but miss delayed data. Larger windows catch more but add meaningful compute cost. Measure your latency profile first: what percentage of your data arrives same-day, within 3 days, within a week? That distribution determines the right window size.

On large tables, use incremental_predicates to limit the destination scan during merge operations. This tells the warehouse to only check for matches within recent partitions, not the entire table:

{{ config(
    incremental_predicates=[
        "DBT_INTERNAL_DEST.created_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)"
    ]
) }}

No lookback window catches everything forever. Data can arrive arbitrarily late or external systems can have outages. Schedule periodic full refreshes—weekly or monthly—to reset accumulated drift. For very large tables where full refresh is impractical, at least run one manually after known pipeline issues.

Warehouse-Specific Behaviors

BigQuery merge requires unique_key (without it, the operation fails rather than append). Insert_overwrite with static partitions is extremely efficient and avoids the partition discovery overhead. But the insert_overwrite strategy on Snowflake replaces the entire table, not just affected partitions—a critical gotcha that has caused data loss for many teams.

Snowflake’s merge hits performance walls around 100M rows. Delete+insert becomes viable and often preferred above that scale. Snowflake’s insert_overwrite is a trap; use delete+insert with explicit date predicates instead.

Databricks supports merge on Delta Lake with ACID guarantees and schema evolution. Replace_where is a Databricks-exclusive predicate-based alternative that’s often cleaner than insert_overwrite when partition behavior is unpredictable.

Common Pitfalls

Not setting unique_key with merge: Results in append-only behavior with duplicates on every incremental run.

NULLs in unique_key columns: Match failures during merge, creating silent duplicates.

No lookback window: Late-arriving data is missed; models drift from source.

Using incremental for small tables: Adds complexity without measurable benefit.

Missing schema change handling: Adding columns without on_schema_change configuration causes failures.

No periodic full refresh: Drift accumulates indefinitely.

Using merge for very large tables without predicates: Full table scans become expensive and negate incremental benefits.

Relying solely on unique_key for deduplication: The initial full refresh doesn’t apply merge logic. Include explicit deduplication in your SELECT:

QUALIFY ROW_NUMBER() OVER (
    PARTITION BY event_id
    ORDER BY updated_at DESC
) = 1

Deciding Between Strategies

Merge is the simplest starting point for tables under 100M rows with row-level updates. For large Snowflake tables, delete+insert often outperforms merge significantly. On BigQuery, insert_overwrite with static partitions is efficient and cost-effective for time-partitioned fact tables. For large historical backfills and situations where batch-level retry would save reprocessing time, microbatch simplifies the logic and provides built-in backfill support.

The microbatch strategy works especially well for time-partitioned event data where you don’t want to write is_incremental() logic. Trade-offs: it uses UTC for all calculations (convert local timestamps first), has a minimum batch size of hour (not suitable for sub-hourly streaming), and runs batches sequentially by default.

Choose the simplest strategy that covers your requirements. Start by measuring how late your data actually arrives, how large your tables will grow, and which warehouse you’re using. That data determines whether merge is sufficient or whether you need to optimize further.

For comprehensive implementation patterns and warehouse-specific details, see the complete incremental models guide.