Adrienne Vermorel
Late-Arriving Data in dbt: Patterns That Actually Work
Late-arriving data is why incremental models drift from source truth. Your model processes yesterday’s events, marks them complete, and moves on. But some of those events don’t actually arrive until three days later. Without a strategy to catch them, your incremental model slowly diverges from reality.
This gap between event time and arrival time exists in every data pipeline. Understanding when your data arrives late, and by how much, determines which patterns will work for your use case.
The core problem
Two timestamps matter for incremental processing:
- Event time: When something happened (a purchase, a page view, a login)
- Load time: When that record landed in your warehouse
When a record’s event time falls before the maximum event time already in your target table, that data is “late.” Your standard incremental filter would skip it entirely.
-- This misses late-arriving data{% if is_incremental() %}WHERE event_date > (SELECT MAX(event_date) FROM {{ this }}){% endif %}If your model ran yesterday and processed events through January 15th, any January 14th events that arrive today get ignored. They’re late, and your filter doesn’t know to look for them.
Measuring your latency profile
Before picking a strategy, understand your data. Run this against your source tables:
SELECT DATE_DIFF(DATE(_loaded_at), DATE(event_timestamp), DAY) AS days_late, COUNT(*) AS record_count, ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct_of_totalFROM source_tableWHERE _loaded_at > event_timestampGROUP BY 1ORDER BY 1This shows you the distribution. Maybe 95% of your data arrives same-day, 4% arrives within 3 days, and 1% trickles in over a week. That distribution shapes everything that follows.
The lookback window pattern
The fix is straightforward: instead of only processing new data, reprocess a rolling window of recent data on every run.
{% if is_incremental() %}WHERE event_date >= ( SELECT DATE_SUB(MAX(event_date), INTERVAL 3 DAY) FROM {{ this }}){% endif %}Combined with a unique_key configuration, this catches late arrivals by re-merging recent data. Records that already exist get updated; new late arrivals get inserted.
Choosing your window size
There’s no universal answer. Each option trades cost against coverage:
1 day: Minimal reprocessing cost. Catches data that arrives just slightly late. Misses anything delayed longer.
3 days: A common starting point. Handles the majority of late arrivals in most systems. The extra compute cost is usually negligible compared to the data quality benefit.
7 days: Catches nearly everything for most sources. Starts to add meaningful cost on large tables.
14+ days: Maximum coverage. Significant performance impact. Usually only justified for sources with known extreme latency, like third-party data feeds.
Your latency analysis tells you where your data falls. If 99.5% arrives within 3 days, a 3-day window makes sense. If you have a long tail extending to 10 days, you need to decide whether catching that last 0.5% is worth the cost.
Making the window configurable
Build in flexibility for backfills and edge cases:
{% set lookback_days = var('lookback_days', 3) %}
{% if is_incremental() %}WHERE created_at >= ( SELECT DATE_SUB(MAX(created_at), INTERVAL {{ lookback_days }} DAY) FROM {{ this }}){% endif %}Now you can override it when needed:
dbt run --select my_model --vars '{"lookback_days": 14}'This keeps your daily runs efficient while allowing targeted backfills.
Incremental predicates for large tables
On tables with hundreds of millions of rows, the merge operation can get expensive even with a lookback window. Your warehouse scans the entire destination table to find matching records.
incremental_predicates limit that scan:
{{ config( materialized='incremental', unique_key='id', incremental_strategy='merge', incremental_predicates=[ "DBT_INTERNAL_DEST.created_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)" ]) }}This adds a filter to the MERGE statement’s destination side. Instead of scanning 500 million rows to find matches, the warehouse only scans the last week’s partitions.
BigQuery gotcha
Partition pruning on BigQuery only works with literal or constant values. Subqueries like SELECT MAX(date) won’t trigger pruning. If you need dynamic dates, compute them in a macro that resolves to a literal before the query runs.
Partition-based strategies
For partitioned tables, replacing entire partitions is often more efficient than row-by-row merging. This approach works especially well for time-series data where late arrivals affect a predictable set of partitions.
BigQuery insert_overwrite
{% set partitions_to_replace = [ 'CURRENT_DATE()', 'DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)', 'DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY)'] %}
{{ config( materialized='incremental', incremental_strategy='insert_overwrite', partition_by={'field': 'event_date', 'data_type': 'date'}, partitions=partitions_to_replace) }}
SELECT event_id, event_date, user_id, event_dataFROM {{ ref('base_events') }}{% if is_incremental() %}WHERE event_date IN ({{ partitions_to_replace | join(',') }}){% endif %}With static partitions, BigQuery knows exactly which partitions to replace. This avoids the partition discovery query that dynamic mode requires.
Snowflake: delete+insert instead
Snowflake’s insert_overwrite replaces the entire table, not individual partitions. For partition-based overwrites, use delete+insert with explicit predicates:
{{ config( materialized='incremental', incremental_strategy='delete+insert', unique_key='event_date') }}
SELECT event_id, event_date, user_id, event_dataFROM {{ ref('base_events') }}{% if is_incremental() %}WHERE event_date >= DATEADD(DAY, -3, CURRENT_DATE){% endif %}Databricks replace_where
Databricks offers replace_where for predicate-based partition replacement:
{{ config( materialized='incremental', incremental_strategy='replace_where', incremental_predicates=["event_date >= CURRENT_DATE - INTERVAL 3 DAYS"]) }}This atomically replaces rows matching the predicate. You get partition-level efficiency with the flexibility of arbitrary filter conditions.
The partition window limitation
Partition-based strategies have a blind spot: data arriving after its partition’s window has passed gets lost.
Say you replace the last 3 days of partitions every run. On January 20th, you reprocess January 17-20. Any January 16th events arriving on January 20th don’t get picked up because January 16th isn’t in your replacement window anymore.
For most systems, this edge case is rare enough to ignore. For critical data, you need either:
- Longer partition windows (with the associated cost)
- Periodic full refreshes to catch anything that slipped through
- A merge-based approach that can update any row regardless of partition
Idempotency through deduplication
A lookback window means processing the same records multiple times. Your model must produce identical results regardless of how many times it runs.
The first defense is unique_key. Records matching an existing key get updated rather than duplicated. But this only works on subsequent runs. Your initial full refresh processes all data without the merge logic.
Add explicit deduplication to catch duplicates at any stage:
SELECT event_id, event_timestamp, user_id, event_data, _loaded_atFROM {{ ref('base_events') }}{% if is_incremental() %}WHERE event_timestamp >= ( SELECT DATE_SUB(MAX(event_timestamp), INTERVAL 3 DAY) FROM {{ this }}){% endif %}QUALIFY ROW_NUMBER() OVER ( PARTITION BY event_id ORDER BY event_timestamp DESC) = 1The QUALIFY clause keeps only the most recent version of each record, preventing duplicates from entering your model even on the first run.
Surrogate keys for complex uniqueness
When uniqueness spans multiple columns, generate a surrogate key:
{{ config( unique_key='surrogate_key') }}
SELECT {{ dbt_utils.generate_surrogate_key(['user_id', 'event_date', 'event_type']) }} AS surrogate_key, user_id, event_date, event_type, event_dataFROM {{ ref('base_events') }}This avoids the edge cases that can arise from composite unique_key configurations across different warehouses.
Testing late-arriving data handling
Unit tests (dbt 1.8+) let you simulate late arrivals:
unit_tests: - name: test_late_data_captured model: my_incremental_model overrides: macros: is_incremental: true given: - input: this rows: - {id: 1, value: 'original', updated_at: '2024-01-15'} - input: ref('source') rows: - {id: 1, value: 'updated', updated_at: '2024-01-17'} - {id: 2, value: 'late_arrival', updated_at: '2024-01-13'} expect: rows: - {id: 1, value: 'updated'} - {id: 2, value: 'late_arrival'}The test verifies that the late-arriving record (id: 2, dated January 13th when the model already has January 15th data) gets captured within your lookback window.
For production validation, compare incremental and full refresh results:
{{ audit_helper.compare_queries( a_query="SELECT * FROM model_full_refresh WHERE event_date >= CURRENT_DATE - 7", b_query="SELECT * FROM model_incremental WHERE event_date >= CURRENT_DATE - 7", primary_key='id') }}Drift between these indicates your lookback window isn’t catching everything.
The full refresh safety net
No lookback window catches everything forever. Data can arrive arbitrarily late. External systems can have outages that delay data for weeks.
Schedule periodic full refreshes, weekly or monthly depending on your tolerance for drift. For very large tables where full refresh is impractical, at least run one manually after any known data pipeline issues.
Consider adding full_refresh: false to critical models to prevent accidental full rebuilds:
{{ config( materialized='incremental', full_refresh=false) }}This forces explicit --full-refresh flags, preventing accidental multi-hour rebuilds.
Putting it together
Here’s what a production-ready incremental model handling late arrivals looks like:
{{ config( materialized='incremental', unique_key='event_id', incremental_strategy='merge', incremental_predicates=[ "DBT_INTERNAL_DEST.event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)" ], full_refresh=false) }}
{% set lookback_days = var('lookback_days', 3) %}
SELECT event_id, event_date, user_id, event_data, _loaded_atFROM {{ ref('base_events') }}{% if is_incremental() %}WHERE event_date >= ( SELECT DATE_SUB(MAX(event_date), INTERVAL {{ lookback_days }} DAY) FROM {{ this }}){% endif %}QUALIFY ROW_NUMBER() OVER ( PARTITION BY event_id ORDER BY _loaded_at DESC) = 1This model:
- Reprocesses the last 3 days by default
- Allows override via
--varsfor backfills - Limits destination scans with predicates
- Prevents accidental full refreshes
- Deduplicates within the source data
The right pattern depends on your data’s latency profile, your table sizes, and your tolerance for drift. Start by measuring how late your data actually arrives, then pick the simplest approach that covers your needs.