dlt Core Concepts

dlt (data load tool) organizes data loading around four concepts: sources, resources, pipelines, and schemas. Understanding these four building blocks explains why a dlt pipeline can be 20 lines of Python and still handle pagination, rate limiting, schema inference, and incremental loading automatically.

Sources

A source is a logical grouping of related data extractions. You declare one with the @dlt.source decorator. The source defines configuration that multiple endpoints share — base URL, authentication credentials, common headers. Think of it as the container that holds several resources.

@dlt.source
def my_api(api_key=dlt.secrets.value):
    return [users(), orders()]

Sources don’t do the extraction work themselves. They’re coordination objects: they hold shared configuration and bundle resources together so you can run them as a unit.

Resources

Resources are the modular extraction units. Each resource represents one specific data extraction — one endpoint, one table, one stream. You declare them with @dlt.resource.

The key design choice in dlt is that resources are Python generators. Generators yield data incrementally rather than accumulating it all in memory first, which means a resource can process millions of records without crashing on memory limits:

@dlt.resource(write_disposition="merge", primary_key="id")
def users():
    for page in paginate_api("/users"):
        yield page

Resources are composable. They can be run standalone or grouped into a source. They carry their own configuration — write disposition, primary key, schema hints — so the behavior is self-contained and doesn’t depend on how they’re invoked.

Pipelines

A pipeline executes the work. You create one with dlt.pipeline(), naming the destination and dataset:

pipeline = dlt.pipeline(
    pipeline_name="my_pipeline",
    destination="bigquery",
    dataset_name="raw_data"
)

The pipeline handles the full extract-normalize-load cycle. It extracts from your resources, normalizes nested structures into relational tables, and loads into the destination. Between runs, it tracks state so incremental loading knows where to resume.

Running a pipeline is a single call:

load_info = pipeline.run(users())

The load_info object contains details about what was loaded — row counts, schema changes, timing. This is your primary handle on what happened.

Schemas

dlt infers schemas automatically from your data. It detects types, normalizes nested JSON into relational tables, and handles schema evolution without manual intervention. When your source adds a new field, dlt triggers an automatic table migration. You don’t write ALTER TABLE statements.

Schema inference is the biggest workflow difference from writing raw pipeline code. Instead of maintaining column mappings as your source evolves, you let dlt discover and update them. For source APIs that change frequently — which is nearly all of them — this eliminates a large maintenance category.

For production pipelines where you want to enforce rather than accept changes, schema contracts let you set explicit rules:

@dlt.resource(schema_contract={"columns": "freeze"})
def orders():
    yield from get_orders()

Contract modes can be freeze (reject unexpected changes), evolve (accept any change), or disallow (reject new columns while allowing type changes). Most production pipelines use evolve during development and tighten to freeze once the schema stabilizes.

Write Dispositions

Write dispositions control how data lands in the destination table. There are three:

Replace drops and recreates the table each run. Simple and correct, but expensive at scale. Good for small reference tables or when you need a clean slate.

Append adds new records to existing data without touching what’s already there. The right choice for immutable event streams where records never update.

Merge upserts using a primary key or merge key. Matching rows update; new rows insert. This is what you want for mutable entities — users, orders, accounts — where records change over time:

@dlt.resource(write_disposition="merge", primary_key="id")
def orders():
    yield from get_orders()

The choice of write disposition has significant performance and cost implications. Merge requires comparing incoming rows against existing data; on large tables this gets expensive. See dlt Incremental Loading for how write dispositions interact with incremental loading state.

Why It Comes Together

The four concepts are designed to compose. A source bundles resources. Resources yield data incrementally. The pipeline runs them, infers schemas, and loads with the specified write disposition. You write Python functions — not YAML, not a proprietary DSL, not a UI-configured connector. The pipeline is version-controlled, testable, and runs anywhere Python runs.

A complete basic pipeline targeting BigQuery might be 20-30 lines:

import dlt

@dlt.resource(write_disposition="merge", primary_key="id")
def users():
    for page in paginate_api("/users"):
        yield page

pipeline = dlt.pipeline(
    pipeline_name="user_sync",
    destination="bigquery",
    dataset_name="raw"
)

load_info = pipeline.run(users())
print(load_info)

This handles pagination via the generator pattern, schema inference, BigQuery table creation, and upsert behavior — without any additional framework configuration.

For more on how dlt handles state between runs and fetches only new data, see dlt Incremental Loading. For BigQuery-specific loading strategies and optimizations, see dlt and BigQuery Integration.