dlt: The Python-Native Data Loader That Changes the Build vs Buy Equation

Analytics engineers face an uncomfortable choice when loading data into warehouses. Pay $12,000+ per year for managed tools like Fivetran, or spend weeks building custom pipelines that require ongoing maintenance. dlt offers a third path that’s gaining traction among Python-proficient data teams.

The Problem dlt Solves

Adrian Brudaru founded dltHub in Berlin after a decade of freelance data engineering work. His insight was simple: every data project required recreating similar solutions from scratch. The wheel kept getting reinvented.

The data shows this isn’t just one person’s experience. According to Wakefield Research, data engineers spend 44% of their time building and maintaining pipelines, costing companies approximately $520,000 per year. Custom connector development alone takes 50-100 hours per connector.

dlt addresses what its creators call “the unsolved problem of Pythonic data ingestion.” The library fills a specific gap between expensive SaaS solutions that charge per row and building everything from scratch with requests and pandas.

The core philosophy is code-first and Python-native. Pipelines are standard Python scripts, not YAML configurations or UI-based builders. You install it with pip, write Python, and run pipelines wherever Python runs. No containers, orchestration servers, or external APIs required to get started.

Core Concepts for Analytics Engineers

dlt organizes data loading around four concepts: sources, resources, pipelines, and schemas.

Sources are logical groupings declared with the @dlt.source decorator. They define common configuration (base URL, authentication) that multiple endpoints share.

Resources are modular building blocks representing specific data extractions. Declared with @dlt.resource, they can be Python generators that yield data incrementally for memory efficiency:

@dlt.resource(write_disposition="merge", primary_key="id")
def users():
    for page in paginate_api("/users"):
        yield page

Pipelines execute the actual work. You create one with dlt.pipeline(), specifying the destination and dataset name. The pipeline handles extraction, normalization, and loading while tracking state between runs.

Schemas are inferred automatically from your data. dlt detects types, normalizes nested structures into relational tables, and handles evolution automatically. When source data adds new fields, dlt triggers automatic table migrations. For production pipelines, schema contracts let you enforce data quality constraints instead of accepting any change.

Three write dispositions control how data lands:

replace: Full table replacement each run
append: Add new records to existing data
merge: Upsert using primary or merge keys

These building blocks combine well in practice. A basic dlt script might be 20 lines of Python, but it handles pagination, rate limiting, schema inference, and incremental loading automatically.

BigQuery Integration

For analytics engineers on BigQuery, dlt includes native optimizations worth understanding.

Installation adds the BigQuery extras:

pip install "dlt[bigquery]"

Configuration lives in secrets.toml with your service account credentials:

[destination.bigquery]
project_id = "your-project"
private_key = "-----BEGIN PRIVATE KEY-----\n..."
client_email = "sa@your-project.iam.gserviceaccount.com"

dlt offers two loading strategies for BigQuery. Streaming inserts work well for low-latency append-only loads where data appears quickly but costs more per row. GCS staging handles large loads better: dlt uploads to a Cloud Storage bucket, then issues a COPY command into BigQuery. For anything beyond small datasets, staging is the right choice and can significantly impact your BigQuery costs.

The bigquery_adapter() function exposes partitioning and clustering:

from dlt.destinations.adapters import bigquery_adapter

@dlt.resource
def events():
    yield from get_events()

# Partition by date, cluster by user_id
bigquery_adapter(
    events,
    partition="event_date",
    cluster=["user_id"]
)

Nested JSON structures normalize into parent-child tables automatically. An API returning users with nested addresses creates both a users table and a users__addresses child table, linked by _dlt_id and _dlt_parent_id columns. You can control nesting depth with max_table_nesting. Setting it to 2-3 creates readable schemas without excessive table proliferation.

dlt also creates metadata tables in your dataset: _dlt_loads tracks pipeline runs, _dlt_pipeline_state stores incremental loading state, and _dlt_version records schema versions.

Incremental Loading

Incremental loading separates toy pipelines from production ones, much like incremental models in dbt separate prototypes from production transformations. dlt handles this with cursor-based tracking that requires minimal configuration.

Cursor-based incremental loading uses dlt.sources.incremental() with a cursor path:

@dlt.resource(primary_key="id")
def orders(
    updated_at=dlt.sources.incremental(
        "updated_at",
        initial_value="2024-01-01T00:00:00Z"
    )
):
    for page in api.get_orders(since=updated_at.last_value):
        yield page

On first run, dlt uses the initial value. On subsequent runs, it automatically tracks the maximum cursor value seen and only fetches newer records. The state persists in your destination, so no external state store is needed.

For REST API sources, the configuration is declarative:

{
    "name": "orders",
    "endpoint": {
        "path": "/orders",
        "params": {
            "since": {
                "type": "incremental",
                "cursor_path": "updated_at",
                "initial_value": "2024-01-01T00:00:00Z"
            }
        }
    }
}

Memory efficiency comes from yielding pages instead of accumulating entire datasets. A generator-based resource can process millions of records without loading them all into memory.

Production Readiness Check

Numbers help assess whether a tool is ready for production workloads.

dlt’s GitHub repository shows ~4,700 stars, 400+ forks, 146 contributors, and over 4,000 commits. Version 1.19.1 shipped in December 2025. The library reached stable 1.0 status and continues active development. Over 1,300 repositories use dlt, with 113 total releases published.

More important than GitHub metrics are production deployments.

One practitioner reported “ETL cost down 182x per month, sync time improved 10x” after migrating from Fivetran to dlt.

The dlt Slack community provides active support, which helps when you’re troubleshooting a pipeline at 2 AM.

Limitations

dlt has real constraints that affect whether it’s right for your team.

No built-in monitoring dashboard. You either build observability yourself or wait for dltHub’s planned platform. In the meantime, teams integrate with Dagster or Airflow UIs, or build custom logging.

No visual interface. Everything is code. This is a feature for some teams and a blocker for others. Non-technical users can’t configure connectors themselves.

DIY operations. Teams handle deployment, scaling, security, and monitoring. dlt gives you the loading primitives, but running it in production requires infrastructure work. You’re responsible for scheduling, alerting, and failure recovery.

No native CDC connectors. If you need change data capture from databases with something like Debezium, you’ll need to build that integration or use another tool.

Limited pre-built connectors. dlt offers 60+ verified sources compared to Fivetran’s 700+. The REST API builder can generate pipelines for many more APIs, but you’re writing configuration rather than clicking “enable.”

These trade-offs shift costs from subscription fees to engineering time.

When dlt Is the Right Choice

dlt fits specific team profiles well.

Python-proficient teams who want control. If your data engineers are comfortable writing and maintaining Python, dlt feels natural. The code-first approach means pipelines are version-controlled, testable, and reviewable like any other code.

Budget-conscious teams willing to invest engineering time. dlt is Apache 2.0 licensed with no fees. You pay for compute and storage only. For teams with engineering capacity but limited tool budgets, the economics work out clearly.

Greenfield projects or custom sources. When you’re building something new and need to ingest from APIs without pre-built connectors, dlt’s REST API source and decorator patterns let you move fast.

Teams already orchestrating with Airflow or Dagster. dlt integrates well with existing orchestrators. If you’re running Airflow anyway, adding dlt pipelines as tasks is straightforward, and certainly simpler than running both Fivetran and an orchestrator. Teams using Google Cloud Functions for dbt can deploy dlt pipelines on the same infrastructure.

Rapid prototyping needs. dlt can run locally with DuckDB as a destination. You can prototype a pipeline in an afternoon, verify the data shape, then switch to BigQuery for production.

When to Look Elsewhere

dlt isn’t universally appropriate.

Non-technical data teams. If your data practitioners don’t write Python, dlt creates a hard dependency on engineering for every connector change.

Immediate need for 700+ connectors. If you need to ingest from dozens of SaaS tools tomorrow, Fivetran or Airbyte gives you pre-built connectors that dlt doesn’t have.

Compliance requirements demanding SOC 2 out of the box. dlt inherits security from your infrastructure, which works for some organizations. Others need the attestation that comes with managed tools.

No engineering capacity for pipeline maintenance. APIs change, schemas evolve, edge cases appear. Someone needs to maintain dlt pipelines. If that capacity doesn’t exist, managed tools make more sense despite higher costs.

Very high sync frequency requirements. While dlt can run on any schedule your orchestrator supports, the operational overhead of sub-hourly syncs is yours to manage. Managed tools handle this for you.

The Strategic Picture

The data integration market is shifting. Fivetran’s March 2025 pricing change (moving from account-wide to per-connector MAR tiering) drove significant frustration among users, with reports of 2-8x cost increases. Marketing data with high update frequency is particularly affected.

This pricing pressure is pushing more teams to evaluate alternatives seriously. dlt occupies a specific position: production-grade data loading with zero licensing cost, requiring Python skills and engineering ownership.

For the right team, dlt eliminates the build-vs-buy false dichotomy. The library handles pagination, incremental loading, schema evolution, and destination specifics, so you’re not building from scratch. But the pipeline is your code, running on your infrastructure, evolving under your control.