Late-Arriving Data and the Lookback Window Pattern

Late-arriving data is the primary reason incremental models drift from source truth. A record with a timestamp of March 20 arrives on March 24. The incremental model already processed March 20’s data three days earlier. Without handling, that late record is missed permanently and the gap between model and source widens with each run.

GA4 events, ad platform conversions, and CDC pipelines all exhibit late arrival. The lookback window pattern is the standard approach.

The Pattern

Instead of processing only data newer than the most recent record in your table, you subtract a fixed window and reprocess a rolling period of recent data:

{% set lookback_days = var('lookback_days', 3) %}

{{ config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge'
) }}

SELECT
    event_id,
    event_timestamp,
    user_id,
    event_type
FROM {{ ref('base__analytics__events') }}
{% if is_incremental() %}
WHERE event_timestamp >= (
    SELECT DATEADD(DAY, -{{ lookback_days }}, MAX(event_timestamp))
    FROM {{ this }}
)
{% endif %}

Every run reprocesses the last 3 days of data, catching records that arrived late within that window. The unique_key ensures that records already present in the target get updated rather than duplicated (assuming you’re using a merge or delete+insert strategy).

Using var() with a default makes the lookback overridable for ad-hoc backfills:

dbt run --select my_model --vars '{"lookback_days": 10}'

This lets you extend the window after known pipeline issues without changing model code.

Choosing the Window Size

The right window size depends on your data’s latency profile — how long records take to arrive after their event timestamp. This varies wildly by source system.

Window	Coverage	Cost Impact
1 day	Minimal; misses most late data	Lowest
3 days	Good balance; handles 99%+ typical late arrivals	Moderate
7 days	Catches nearly all late data	Higher
14+ days	Maximum coverage	Significant

Three days is a common default for event-driven models and covers the majority of late arrivals in systems like GA4, ad platforms, and typical CDC pipelines. Measuring the actual latency distribution — what percentage of records arrive same-day vs. within 3 days vs. within a week — determines the appropriate window for a specific source.

The cost trade-off is real. A 3-day lookback means you reprocess 3 days of source data on every run instead of just the newest records. On a table that processes 100GB per day, that’s 300GB scanned per run. On BigQuery with on-demand pricing, that’s the difference between $1.25 and $3.75 per run. At scale or high frequency, this adds up.

Lookback + Incremental Predicates

If you’re using the merge strategy on a large table, the lookback window filters the source side but the merge still scans the entire destination table to find matches. You need incremental predicates to limit the destination scan too:

{% set lookback_days = 3 %}

{{ config(
    materialized='incremental',
    unique_key='event_id',
    incremental_strategy='merge',
    incremental_predicates=[
        "DBT_INTERNAL_DEST.event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL " ~ lookback_days ~ " DAY)"
    ]
) }}

SELECT *
FROM {{ ref('base__events') }}
{% if is_incremental() %}
WHERE event_date >= (
    SELECT DATE_SUB(MAX(event_date), INTERVAL {{ lookback_days }} DAY)
    FROM {{ this }}
)
{% endif %}

The predicate window should be at least as wide as the lookback window. Narrower destination predicates can cause the merge to miss legitimate matches.

Lookback With insert_overwrite

On BigQuery, the insert_overwrite strategy replaces entire partitions. You can use a static partition list as a form of lookback:

{% set partitions_to_replace = [
    'CURRENT_DATE()',
    'DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)',
    'DATE_SUB(CURRENT_DATE(), INTERVAL 2 DAY)'
] %}

{{ config(
    materialized='incremental',
    incremental_strategy='insert_overwrite',
    partition_by={'field': 'event_date', 'data_type': 'date'},
    partitions=partitions_to_replace
) }}

This replaces the last 3 days of partitions on every run. It’s often more cost-effective than merge with a lookback because there’s no row-by-row comparison — the entire partition gets swapped.

The Fundamental Limitation

Lookback windows don’t catch records arriving after the window closes. A record from January that arrives in March won’t be caught by a 3-day or even 14-day lookback.

This isn’t a bug — it’s an inherent trade-off. Making the window large enough to catch every possible late arrival would mean reprocessing so much data that you’d lose the performance benefit of incremental processing entirely.

The solution is layered:

Lookback window handles routine late arrivals (covers 95-99% of cases)
Periodic full refreshes reset accumulated drift (weekly or monthly, depending on your tolerance)
On-demand full refresh after known pipeline outages or source system issues

For very large tables where full refresh is expensive or slow, schedule full refreshes during off-peak hours or on weekends. Some teams run incremental daily and full-refresh weekly. Others run incremental hourly and full-refresh monthly. The right cadence depends on your data latency profile and your tolerance for drift.

Monitoring Drift

You can measure drift by comparing incremental results against a full-refresh baseline on a sample of data. If your incremental model and a full refresh of the same time window produce different row counts or aggregates, you have drift.

dbt’s unit testing can validate that your lookback logic is correct for known late-arrival patterns. Test with records that arrive at the boundary of your lookback window to ensure they’re captured.

The key mindset shift: incremental models are not guaranteed to be perfectly consistent with source. They’re an optimization that trades some consistency for dramatically lower cost and faster runtimes. The lookback window and periodic full refresh are how you keep that trade-off acceptable.