ServicesAboutNotesContact Get in touch →
EN FR
Note

dlt Incremental Loading

How dlt tracks state between pipeline runs using cursor-based incremental loading — the dlt.sources.incremental() helper, declarative REST API config, and why state lives in the destination.

Planted
dltdata engineeringetlincremental processing

dlt’s incremental loading handles state tracking with minimal configuration and stores that state in the destination warehouse where it can be inspected directly. A pipeline that fetches the full dataset on every run is slower, more expensive, and more likely to hit rate limits — incremental loading addresses all three.

The Core Mechanism

dlt uses cursor-based incremental loading. You declare a cursor field — typically a timestamp or auto-incrementing ID — and dlt tracks the maximum value seen on each run. Subsequent runs fetch only records newer than that value.

The API for this is dlt.sources.incremental():

@dlt.resource(primary_key="id")
def orders(
updated_at=dlt.sources.incremental(
"updated_at",
initial_value="2024-01-01T00:00:00Z"
)
):
for page in api.get_orders(since=updated_at.last_value):
yield page

On first run, dlt uses initial_value as the historical backfill starting point. On subsequent runs, it uses the maximum updated_at seen in the previous run. The updated_at.last_value property provides the current cursor to pass to the API call. No checkpoint logic, no state file, no separate tracking table — declare the cursor field and dlt handles the rest.

Where State Lives

State persists in your destination warehouse, not in a separate state store. dlt creates a _dlt_pipeline_state table in your dataset that stores the cursor values and other pipeline metadata between runs.

This design choice has practical implications:

It’s visible. You can query the state table directly to see where your pipeline left off. No black-box state files, no external services to check.

It follows your data. If you restore a database backup or move to a different environment, the state moves with your data. There’s no separate state service to keep in sync.

It’s per-destination. If you run the same pipeline against dev and prod BigQuery datasets, each maintains its own state. They don’t interfere with each other.

Declarative Incremental Config for REST APIs

For REST API sources, you configure incremental loading declaratively rather than writing Python logic:

{
"name": "orders",
"endpoint": {
"path": "/orders",
"params": {
"since": {
"type": "incremental",
"cursor_path": "updated_at",
"initial_value": "2024-01-01T00:00:00Z"
}
}
}
}

This configuration does the same thing as the Python generator example above. The cursor_path tells dlt where to find the cursor value in the response, and initial_value sets the starting point. dlt handles tracking and injects the current cursor value into the since parameter on each run.

The declarative form is particularly useful for AI-assisted pipeline development: you can generate these configurations from API documentation without writing Python logic. See dlt for AI-Assisted Pipeline Development for the full workflow.

Memory Efficiency

The generator pattern that makes incremental loading work also handles memory. A generator-based resource yields pages rather than accumulating the entire dataset:

@dlt.resource(primary_key="id")
def orders(
updated_at=dlt.sources.incremental("updated_at", initial_value="2024-01-01T00:00:00Z")
):
for page in api.get_orders(since=updated_at.last_value):
yield page # yields one page at a time, not the entire dataset

This means a pipeline processing millions of records doesn’t hold them all in memory simultaneously. Each page is yielded, normalized, and loaded before the next page is fetched. For large datasets or APIs that return significant data volumes, this is the difference between a pipeline that works and one that OOMs.

Write Disposition Interaction

Incremental loading pairs with the merge write disposition for mutable data. Without primary_key, dlt can track cursor values but can’t deduplicate — you’d get duplicate rows if a record appears in multiple incremental windows.

For append-only data (event logs, click streams, anything that doesn’t update), incremental loading with append disposition works cleanly: track the maximum ID or timestamp, fetch only newer records, append them. No deduplication needed.

For mutable entities — users, orders, anything that updates — you need both incremental loading (to fetch only changed records) and merge disposition with a primary_key (to upsert rather than duplicate):

@dlt.resource(
write_disposition="merge",
primary_key="id"
)
def users(
updated_at=dlt.sources.incremental("updated_at", initial_value="2024-01-01")
):
for page in api.get_users(modified_since=updated_at.last_value):
yield page

Relationship to dbt Incremental Models

If you’re using dbt downstream of dlt, you’re managing incrementality at two layers. dlt handles the extraction layer — fetching only new or changed records from the source. dbt handles the transformation layer — processing only new or changed records through your models.

These layers are independent but complementary. dlt’s incremental loading reduces what lands in your raw layer. dbt’s incremental models reduce what gets reprocessed in your transformation layer. A well-structured pipeline uses both.

The mental model differs between them: dlt’s incremental loading is about API state tracking (where did I leave off?), while dbt’s incremental models are about query optimization (how do I avoid scanning the full table?). Both improve efficiency, but they solve different problems.

When Incremental Loading Matters

Not every pipeline needs incremental loading. For small datasets that load in seconds and cost negligible compute, full replacement (replace disposition) is simpler and avoids state management edge cases.

Incremental loading is appropriate when:

  • The source dataset is large enough that full extraction takes significant time
  • The source API rate-limits total data transfer
  • The destination table is large enough that full reload is expensive
  • Frequent pipeline runs require near-real-time data

For how incremental loading fits into the broader dlt pipeline structure, see dlt Core Concepts. For BigQuery-specific considerations around incremental loads and staging, see dlt and BigQuery Integration.