ServicesAboutNotesContact Get in touch →
EN FR
Note

Late-Arriving Data in dbt — Hub

Hub note connecting all concepts around handling late-arriving data in dbt incremental models: measurement, lookback windows, partition strategies, deduplication, testing, and operational safety.

Planted
dbtbigquerysnowflakedatabricksincremental processingdata qualitydata engineering

Late-arriving data is why incremental models drift from source truth. A record with an event timestamp of March 20 lands in your warehouse on March 24. Your model processed March 20’s data three days ago and moved on. Without a strategy to catch late arrivals, that record is missed permanently, and the gap between your model and source truth widens with every run.

This hub connects the concepts you need to handle late-arriving data correctly in dbt.

The Core Problem

Two timestamps drive the issue: event time (when something happened) and load time (when the record reached your warehouse). Standard incremental filters only look at load time or max event time — they miss records that arrived late relative to the model’s last run.

-- This misses late-arriving data
{% if is_incremental() %}
WHERE event_date > (SELECT MAX(event_date) FROM {{ this }})
{% endif %}

If this model ran on January 15th and captured events through January 14th, any January 13th events arriving today get skipped entirely.

Start by Measuring

Before choosing any strategy, measure your data’s actual latency profile. The distribution of arrival delays determines which patterns make sense and what window size is appropriate.

Measuring Data Latency Before Choosing an Incremental Strategy

The Lookback Window

The standard defense: reprocess a rolling window of recent data on every run. Instead of filtering to new-only data, subtract a fixed window from the current maximum and reprocess everything from there.

Make it configurable so you can extend the window for backfills without changing model code:

{% set lookback_days = var('lookback_days', 3) %}
{% if is_incremental() %}
WHERE event_date >= (
SELECT DATE_SUB(MAX(event_date), INTERVAL {{ lookback_days }} DAY)
FROM {{ this }}
)
{% endif %}

Late-Arriving Data and the Lookback Window Pattern

Limiting Destination Scans with Incremental Predicates

On large tables, the lookback window filters source data but the merge still scans the entire destination table. Incremental predicates constrain the destination scan to the same window, keeping merge viable at scale.

Incremental Predicates for dbt Merge

Partition-Based Strategies

For time-partitioned tables, replacing entire partitions is often more efficient than row-by-row merging. The right approach differs by warehouse:

  • BigQuery: insert_overwrite with a static partition list
  • Snowflake: delete+insert with date predicates (not insert_overwrite — that replaces the entire table)
  • Databricks: replace_where with a predicate expression

Each has a blind spot: records arriving after their partition’s replacement window passes get permanently missed without a full refresh or longer window.

dbt Incremental Strategy Configuration Patternsdbt Incremental Strategy Warehouse BehaviorsIncremental Strategy Decision Framework

Deduplication and Idempotency

A lookback window means processing the same records multiple times. Your model must produce identical results regardless of how many times it runs. Relying solely on unique_key isn’t enough — the initial full refresh doesn’t apply merge logic. Add explicit deduplication in your SELECT:

QUALIFY ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY _loaded_at DESC
) = 1

When uniqueness spans multiple columns, use generate_surrogate_key from dbt-utils to create a single stable key.

Idempotent Incremental Models in dbtdbt-utils generate_surrogate_key

Testing Late-Arriving Data Handling

dbt unit tests (1.8+) can simulate late arrivals by overriding is_incremental() and providing fixture rows for {{ this }}. In production, compare incremental results against a full-refresh baseline to detect drift.

Testing Late-Arriving Data Handling in dbt

The Full Refresh Safety Net

No lookback window catches everything. Data can arrive arbitrarily late. The full_refresh: false config prevents accidental multi-hour rebuilds while keeping intentional full refreshes possible. Schedule periodic full refreshes to reset accumulated drift.

The full_refresh: false Guard in dbt

Microbatch as an Alternative

dbt’s microbatch strategy (1.9+) has built-in lookback support via the lookback config parameter. It handles filtering automatically without is_incremental() logic, and provides batch-level retry for failed periods. Trade-offs: minimum batch size is hourly, all time calculations use UTC, and batches run sequentially.

dbt Microbatch Strategy Tradeoffs

Source Article

→ Late-Arriving Data in dbt: Patterns That Actually Work