Idempotent Incremental Models in dbt

An idempotent model produces identical results regardless of how many times it runs. Most incremental models are not idempotent by default, and the failures are subtle — they typically go undetected until a data quality audit.

Why Incremental Models Break Idempotency

The most common way idempotency breaks: relying solely on unique_key for deduplication.

With the merge strategy, unique_key tells dbt how to match incoming rows against existing rows. Matching rows get updated; non-matching rows get inserted. This works on incremental runs — but the first run (or any --full-refresh) doesn’t use merge logic at all. It behaves like a table materialization: insert everything, no deduplication.

If your source data contains duplicates and your model relies on unique_key to handle them, the first run inserts all duplicates. Subsequent incremental runs may update some of them, but the initial duplicates persist. Your table now has rows that wouldn’t exist if you’d run the model incrementally from the start.

-- This model is NOT idempotent
{{ config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge'
) }}

SELECT
    event_id,
    event_timestamp,
    user_id,
    event_type
FROM {{ ref('base__analytics__events') }}
{% if is_incremental() %}
WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }})
{% endif %}

If base__analytics__events contains two rows with the same event_id, the full refresh inserts both. The merge strategy would have caught this on an incremental run, but the first run bypasses merge entirely.

The Fix: Pre-Deduplicate in SELECT

Include explicit deduplication in your SELECT statement so it applies on both full refresh and incremental runs:

{{ config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge'
) }}

SELECT
    event_id,
    event_timestamp,
    user_id,
    event_type,
    event_properties
FROM {{ ref('base__analytics__events') }}
{% if is_incremental() %}
WHERE event_timestamp >= (
    SELECT MAX(event_timestamp) - INTERVAL 3 DAY
    FROM {{ this }}
)
{% endif %}
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY event_id
    ORDER BY event_timestamp DESC
) = 1

The QUALIFY ROW_NUMBER() clause runs on every execution — full refresh or incremental. It picks the most recent version of each event_id before the data ever reaches the merge step. No duplicates enter the staging table, so no duplicates enter the target.

This pattern works across BigQuery, Snowflake, and Databricks. On warehouses that don’t support QUALIFY syntax, use a CTE:

WITH deduplicated AS (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY event_id
            ORDER BY event_timestamp DESC
        ) AS rn
    FROM {{ ref('base__analytics__events') }}
    {% if is_incremental() %}
    WHERE event_timestamp >= (
        SELECT MAX(event_timestamp) - INTERVAL 3 DAY
        FROM {{ this }}
    )
    {% endif %}
)

SELECT * EXCEPT(rn)
FROM deduplicated
WHERE rn = 1

NULLs in unique_key Break Idempotency Too

NULL values in unique_key columns cause match failures during merge operations. In SQL, NULL = NULL evaluates to NULL (not TRUE), so two rows with NULL in a unique_key column never match. Instead of updating the existing row, dbt inserts a new one. Run the model again with the same data and you get another duplicate. Each run adds more.

-- Composite key where session_number can be NULL
{{ config(unique_key=['user_id', 'session_number']) }}

-- If session_number is NULL for some rows, those rows
-- will NEVER match during merge, creating duplicates on every run

The fix: ensure all columns in unique_key are non-null. Either filter out NULLs, coalesce them to a sentinel value, or choose a different unique_key that doesn’t include nullable columns.

{{ config(unique_key='surrogate_key') }}

SELECT
    {{ dbt_utils.generate_surrogate_key(['user_id', 'coalesce(session_number, -1)']) }} AS surrogate_key,
    user_id,
    session_number,
    ...

Lookback Windows and Idempotency

The lookback window pattern interacts with idempotency in an important way. If your lookback reprocesses 3 days of data and your deduplication picks the most recent version of each record, running the model multiple times produces the same result: the same 3-day window gets reprocessed, the same deduplication logic applies, and the same rows land in the target.

But if your lookback uses MAX(event_timestamp) FROM {{ this }} as its anchor and the table’s max timestamp changes between runs (because new data arrived), the lookback window shifts. This is expected behavior, not a violation of idempotency — it means the model produces correct results for the current state of the data, not that it produces identical output regardless of when it runs.

True idempotency means: given the same source data, produce the same output regardless of run count. The lookback window ensures that late-arriving data within the window gets incorporated correctly, and pre-deduplication ensures that duplicates in source data don’t propagate.

Testing Idempotency

The most direct test: run your model twice in a row without changing source data and compare row counts. If the count changes, the model isn’t idempotent.

dbt run --select my_model
# Record row count
dbt run --select my_model
# Compare row count — should be identical

dbt’s unit testing framework can validate idempotency more rigorously. Create test fixtures with known duplicates and verify that the model output contains only deduplicated rows regardless of whether it runs as a full refresh or incremental.

For models using the append strategy, idempotency requires different handling since append never deduplicates. Either accept duplicates and handle them downstream, or don’t use append for data that can contain duplicates.

The Checklist

Pre-deduplicate in your SELECT (QUALIFY or ROW_NUMBER), not just via unique_key
Ensure no NULLs in unique_key columns
Test both full-refresh and incremental modes to verify consistent results
If using lookback windows, verify that reprocessed data merges correctly
For append strategy, handle deduplication downstream