Adrienne Vermorel

Late-Arriving Data in dbt: Patterns That Actually Work

Late-arriving data is why incremental models drift from source truth. Your model processes yesterday’s events, marks them complete, and moves on. But some of those events don’t actually arrive until three days later. Without a strategy to catch them, your incremental model slowly diverges from reality.

This gap between event time and arrival time exists in every data pipeline. Understanding when your data arrives late, and by how much, determines which patterns will work for your use case.

The core problem

Two timestamps matter for incremental processing:

  • Event time: When something happened (a purchase, a page view, a login)
  • Load time: When that record landed in your warehouse

When a record’s event time falls before the maximum event time already in your target table, that data is “late.” Your standard incremental filter would skip it entirely.

-- This misses late-arriving data
{% if is_incremental() %}
WHERE event_date > (SELECT MAX(event_date) FROM {{ this }})
{% endif %}

If your model ran yesterday and processed events through January 15th, any January 14th events that arrive today get ignored. They’re late, and your filter doesn’t know to look for them.

Measuring your latency profile

Before picking a strategy, understand your data. Run this against your source tables:

SELECT
DATE_DIFF(DATE(_loaded_at), DATE(event_timestamp), DAY) AS days_late,
COUNT(*) AS record_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct_of_total
FROM source_table
WHERE _loaded_at > event_timestamp
GROUP BY 1
ORDER BY 1

This shows you the distribution. Maybe 95% of your data arrives same-day, 4% arrives within 3 days, and 1% trickles in over a week. That distribution shapes everything that follows.

The lookback window pattern

The fix is straightforward: instead of only processing new data, reprocess a rolling window of recent data on every run.

{% if is_incremental() %}
WHERE event_date >= (
SELECT DATE_SUB(MAX(event_date), INTERVAL 3 DAY)
FROM {{ this }}
)
{% endif %}

Combined with a unique_key configuration, this catches late arrivals by re-merging recent data. Records that already exist get updated; new late arrivals get inserted.

Choosing your window size

There’s no universal answer. Each option trades cost against coverage:

1 day: Minimal reprocessing cost. Catches data that arrives just slightly late. Misses anything delayed longer.

3 days: A common starting point. Handles the majority of late arrivals in most systems. The extra compute cost is usually negligible compared to the data quality benefit.

7 days: Catches nearly everything for most sources. Starts to add meaningful cost on large tables.

14+ days: Maximum coverage. Significant performance impact. Usually only justified for sources with known extreme latency, like third-party data feeds.

Your latency analysis tells you where your data falls. If 99.5% arrives within 3 days, a 3-day window makes sense. If you have a long tail extending to 10 days, you need to decide whether catching that last 0.5% is worth the cost.

Making the window configurable

Build in flexibility for backfills and edge cases:

{% set lookback_days = var('lookback_days', 3) %}
{% if is_incremental() %}
WHERE created_at >= (
SELECT DATE_SUB(MAX(created_at), INTERVAL {{ lookback_days }} DAY)
FROM {{ this }}
)
{% endif %}

Now you can override it when needed:

Terminal window
dbt run --select my_model --vars '{"lookback_days": 14}'

This keeps your daily runs efficient while allowing targeted backfills.

Incremental predicates for large tables

On tables with hundreds of millions of rows, the merge operation can get expensive even with a lookback window. Your warehouse scans the entire destination table to find matching records.

incremental_predicates limit that scan:

{{ config(
materialized='incremental',
unique_key='id',
incremental_strategy='merge',
incremental_predicates=[
"DBT_INTERNAL_DEST.created_at >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)"
]
) }}

This adds a filter to the MERGE statement’s destination side. Instead of scanning 500 million rows to find matches, the warehouse only scans the last week’s partitions.

BigQuery gotcha

Partition pruning on BigQuery only works with literal or constant values. Subqueries like SELECT MAX(date) won’t trigger pruning. If you need dynamic dates, compute them in a macro that resolves to a literal before the query runs.

Partition-based strategies

For partitioned tables, replacing entire partitions is often more efficient than row-by-row merging. This approach works especially well for time-series data where late arrivals affect a predictable set of partitions.

BigQuery insert_overwrite

{% set partitions_to_replace = [
'CURRENT_DATE()',
'DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)',
'DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY)'
] %}
{{ config(
materialized='incremental',
incremental_strategy='insert_overwrite',
partition_by={'field': 'event_date', 'data_type': 'date'},
partitions=partitions_to_replace
) }}
SELECT
event_id,
event_date,
user_id,
event_data
FROM {{ ref('base_events') }}
{% if is_incremental() %}
WHERE event_date IN ({{ partitions_to_replace | join(',') }})
{% endif %}

With static partitions, BigQuery knows exactly which partitions to replace. This avoids the partition discovery query that dynamic mode requires.

Snowflake: delete+insert instead

Snowflake’s insert_overwrite replaces the entire table, not individual partitions. For partition-based overwrites, use delete+insert with explicit predicates:

{{ config(
materialized='incremental',
incremental_strategy='delete+insert',
unique_key='event_date'
) }}
SELECT
event_id,
event_date,
user_id,
event_data
FROM {{ ref('base_events') }}
{% if is_incremental() %}
WHERE event_date >= DATEADD(DAY, -3, CURRENT_DATE)
{% endif %}

Databricks replace_where

Databricks offers replace_where for predicate-based partition replacement:

{{ config(
materialized='incremental',
incremental_strategy='replace_where',
incremental_predicates=["event_date >= CURRENT_DATE - INTERVAL 3 DAYS"]
) }}

This atomically replaces rows matching the predicate. You get partition-level efficiency with the flexibility of arbitrary filter conditions.

The partition window limitation

Partition-based strategies have a blind spot: data arriving after its partition’s window has passed gets lost.

Say you replace the last 3 days of partitions every run. On January 20th, you reprocess January 17-20. Any January 16th events arriving on January 20th don’t get picked up because January 16th isn’t in your replacement window anymore.

For most systems, this edge case is rare enough to ignore. For critical data, you need either:

  • Longer partition windows (with the associated cost)
  • Periodic full refreshes to catch anything that slipped through
  • A merge-based approach that can update any row regardless of partition

Idempotency through deduplication

A lookback window means processing the same records multiple times. Your model must produce identical results regardless of how many times it runs.

The first defense is unique_key. Records matching an existing key get updated rather than duplicated. But this only works on subsequent runs. Your initial full refresh processes all data without the merge logic.

Add explicit deduplication to catch duplicates at any stage:

SELECT
event_id,
event_timestamp,
user_id,
event_data,
_loaded_at
FROM {{ ref('base_events') }}
{% if is_incremental() %}
WHERE event_timestamp >= (
SELECT DATE_SUB(MAX(event_timestamp), INTERVAL 3 DAY)
FROM {{ this }}
)
{% endif %}
QUALIFY ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY event_timestamp DESC
) = 1

The QUALIFY clause keeps only the most recent version of each record, preventing duplicates from entering your model even on the first run.

Surrogate keys for complex uniqueness

When uniqueness spans multiple columns, generate a surrogate key:

{{ config(
unique_key='surrogate_key'
) }}
SELECT
{{ dbt_utils.generate_surrogate_key(['user_id', 'event_date', 'event_type']) }} AS surrogate_key,
user_id,
event_date,
event_type,
event_data
FROM {{ ref('base_events') }}

This avoids the edge cases that can arise from composite unique_key configurations across different warehouses.

Testing late-arriving data handling

Unit tests (dbt 1.8+) let you simulate late arrivals:

unit_tests:
- name: test_late_data_captured
model: my_incremental_model
overrides:
macros:
is_incremental: true
given:
- input: this
rows:
- {id: 1, value: 'original', updated_at: '2024-01-15'}
- input: ref('source')
rows:
- {id: 1, value: 'updated', updated_at: '2024-01-17'}
- {id: 2, value: 'late_arrival', updated_at: '2024-01-13'}
expect:
rows:
- {id: 1, value: 'updated'}
- {id: 2, value: 'late_arrival'}

The test verifies that the late-arriving record (id: 2, dated January 13th when the model already has January 15th data) gets captured within your lookback window.

For production validation, compare incremental and full refresh results:

{{ audit_helper.compare_queries(
a_query="SELECT * FROM model_full_refresh WHERE event_date >= CURRENT_DATE - 7",
b_query="SELECT * FROM model_incremental WHERE event_date >= CURRENT_DATE - 7",
primary_key='id'
) }}

Drift between these indicates your lookback window isn’t catching everything.

The full refresh safety net

No lookback window catches everything forever. Data can arrive arbitrarily late. External systems can have outages that delay data for weeks.

Schedule periodic full refreshes, weekly or monthly depending on your tolerance for drift. For very large tables where full refresh is impractical, at least run one manually after any known data pipeline issues.

Consider adding full_refresh: false to critical models to prevent accidental full rebuilds:

{{ config(
materialized='incremental',
full_refresh=false
) }}

This forces explicit --full-refresh flags, preventing accidental multi-hour rebuilds.

Putting it together

Here’s what a production-ready incremental model handling late arrivals looks like:

{{ config(
materialized='incremental',
unique_key='event_id',
incremental_strategy='merge',
incremental_predicates=[
"DBT_INTERNAL_DEST.event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 7 DAY)"
],
full_refresh=false
) }}
{% set lookback_days = var('lookback_days', 3) %}
SELECT
event_id,
event_date,
user_id,
event_data,
_loaded_at
FROM {{ ref('base_events') }}
{% if is_incremental() %}
WHERE event_date >= (
SELECT DATE_SUB(MAX(event_date), INTERVAL {{ lookback_days }} DAY)
FROM {{ this }}
)
{% endif %}
QUALIFY ROW_NUMBER() OVER (
PARTITION BY event_id
ORDER BY _loaded_at DESC
) = 1

This model:

  • Reprocesses the last 3 days by default
  • Allows override via --vars for backfills
  • Limits destination scans with predicates
  • Prevents accidental full refreshes
  • Deduplicates within the source data

The right pattern depends on your data’s latency profile, your table sizes, and your tolerance for drift. Start by measuring how late your data actually arrives, then pick the simplest approach that covers your needs.