ServicesAboutNotesContact Get in touch →
EN FR
Note

Measuring Data Latency Before Choosing an Incremental Strategy

How to profile the gap between event time and load time in your source tables, and use that distribution to size lookback windows and choose the right incremental strategy.

Planted
dbtbigqueryincremental processingdata qualitydata engineering

The central question for any incremental model that handles late-arriving data is: how late does your data actually arrive? The answer depends on the source system. GA4 events can arrive days after they occur. Ad platform conversions backfill for 28 days. CDC pipelines replay records after outages in unpredictable bursts. Lookback windows should be sized from measured arrival distributions, not intuition.

Two Timestamps, One Gap

Two timestamps define whether a record is “late”:

  • Event time: When something actually happened (a purchase, a page view, a click)
  • Load time: When that record landed in your warehouse (often tracked as _loaded_at or _ingested_at)

A record is late when its event time falls before the maximum event time already processed by your incremental model. The gap between event time and load time — call it arrival latency — is what you’re measuring.

The Latency Profile Query

Run this against your source tables before building any incremental model that will process that data:

SELECT
DATE_DIFF(DATE(_loaded_at), DATE(event_timestamp), DAY) AS days_late,
COUNT(*) AS record_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(), 2) AS pct_of_total
FROM source_table
WHERE _loaded_at > event_timestamp -- Only late records
GROUP BY 1
ORDER BY 1

This gives you a distribution: of all records that arrived after their event timestamp, what percentage arrived 1 day late, 2 days late, 3 days late, and so on.

A typical output might look like this:

days_laterecord_countpct_of_total
1892,45171.2%
2187,33214.9%
3103,2118.2%
442,1183.4%
718,2041.5%
147,8910.6%
28+3,1020.2%

Reading this table: 71% of late records arrive within 1 day, 94% within 3 days, 97% within 4 days. That tail extending to 28+ days represents 0.2% of late records.

What the Distribution Tells You

The distribution answers three questions:

What window covers most of my data? Add up the cumulative percentage. In the example above, 3 days covers 94.3% of late arrivals. 7 days covers 97.8%. The incremental cost of extending from 3 days to 7 days is 3.5% more coverage.

Is there a meaningful long tail? If 5% of records arrive after 14 days, a 3-day window silently misses a lot of data. If 0.2% arrive after 28 days (like above), a 7-day window is probably good enough with periodic full refreshes catching the rest.

Which sources need special handling? Some systems have well-behaved, predictable latency. Others have spiky patterns — mostly same-day with occasional multi-week delays from outages or backfills. Those need different strategies.

Sizing the Window from the Distribution

Once you have the distribution, match the window to your tolerance for missed data:

-- More complete version: include same-day records to see total picture
SELECT
DATE_DIFF(DATE(_loaded_at), DATE(event_timestamp), DAY) AS days_late,
COUNT(*) AS record_count,
ROUND(
SUM(COUNT(*)) OVER (ORDER BY DATE_DIFF(DATE(_loaded_at), DATE(event_timestamp), DAY)
ROWS UNBOUNDED PRECEDING) * 100.0 / SUM(COUNT(*)) OVER(),
1
) AS cumulative_pct
FROM source_table
GROUP BY 1
ORDER BY 1
LIMIT 30

The cumulative_pct column directly tells you what percentage of all records a given window would capture. Find the window size where cumulative coverage reaches your threshold — typically 99% or 99.5% — and use that as your starting point.

Then factor in cost. On BigQuery with on-demand pricing, a 3-day lookback processes 3x more source data per run than a 1-day lookback. On a table that processes 100GB daily, that’s the difference between 100GB and 300GB scanned per run — roughly $0.50 versus $1.50 under standard pricing. At scale or high frequency, this adds up.

Segmenting by Source System

Not all data in a pipeline has the same latency profile. Segment your analysis by source when you have multiple feeds:

SELECT
source_system,
DATE_DIFF(DATE(_loaded_at), DATE(event_timestamp), DAY) AS days_late,
COUNT(*) AS record_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER(PARTITION BY source_system), 2) AS pct_of_source
FROM events_with_metadata
WHERE _loaded_at > event_timestamp
GROUP BY 1, 2
ORDER BY 1, 2

A unified events model might combine GA4 (mostly same-day) with ad platform conversions (commonly 3-7 day latency) and third-party enrichment (can be 14+ days). A single lookback window has to serve the worst case, which may be unnecessarily expensive for the better-behaved sources.

If sources have dramatically different latency profiles, consider separate models with different window configurations rather than forcing a single window to cover all cases.

Rerunning the Analysis

Data latency profiles change. A source system’s ETL pipeline gets slower, or a new API endpoint has different refresh behavior. Run this analysis:

  • Before building any new incremental model on a source you haven’t worked with
  • Quarterly on existing models, especially if you start seeing drift in comparison to full refreshes
  • After any known source system incident that may have changed how data arrives

The lookback window you chose based on last year’s latency profile may not fit this year’s data. The distribution query is fast — it’s a group-by aggregation on your source table — so there’s no reason not to check.

Connecting the Analysis to Strategy

The latency profile doesn’t just size your lookback window. It also helps you choose the right incremental strategy:

  • Narrow, predictable latency (95%+ within 1-2 days): Standard merge or insert_overwrite with a 3-day window. Cost-effective and covers the vast majority of arrivals.
  • Moderate latency with a long tail (tail extending 7-14 days): Consider whether merge with incremental predicates is more appropriate than a wide insert_overwrite window. Or accept that periodic full refreshes will handle the tail.
  • Severe or unpredictable latency (weeks, from third-party feeds or outages): The long tail may be too expensive to catch with a wide window. Design for it explicitly: scheduled full refreshes, or alert when records arrive outside the window.

The incremental model pattern is an optimization that trades some consistency for lower cost and faster runtimes. Measuring your actual latency profile is what makes that trade-off conscious rather than accidental.