dlt for AI-Assisted Pipeline Development

dlt (data load tool) is a Python-native, declarative library for building custom data pipelines with no infrastructure required. Its design properties — standard Python, declarative config, well-structured documentation — make it suitable for AI-assisted pipeline generation.

Why dlt Maps Well to AI

dlt is standard Python — no proprietary DSLs, no YAML configuration formats, no backend infrastructure to manage.

Three properties are relevant to AI-assisted development:

Python is the language with the most LLM training data. Generating a dlt pipeline means working in Python rather than a proprietary DSL like Meltano YAML or Airbyte connector configs, which have far less representation in training data.

Declarative patterns are constrained. dlt’s REST API builder uses a declarative style where you describe endpoints, pagination, authentication, and write disposition, and the framework handles execution. Constrained output patterns produce more consistent AI-generated code.

The documentation is structured for machine consumption. AI assistants can navigate it to generate configurations from API documentation. One dlt user reported completing an entire pipeline “in five minutes using the library’s documentation,” with the workflow being: point AI at the source API docs and dlt’s reference, and generate the configuration.

The REST API Builder in Practice

dlt’s REST API source provides a declarative way to connect to any REST API. A typical marketing API pipeline targeting BigQuery takes about 30 lines:

import dlt
from dlt.sources.rest_api import rest_api_source

# Define a marketing API source with pagination
source = rest_api_source({
    "client": {
        "base_url": "https://api.marketing-platform.com/v1",
        "auth": {"type": "bearer", "token": dlt.secrets.value}
    },
    "resources": [
        {
            "name": "campaigns",
            "endpoint": {
                "path": "campaigns",
                "paginator": {"type": "offset", "limit": 100}
            },
            "write_disposition": "merge",
            "primary_key": "id"
        }
    ]
})

# Create pipeline targeting BigQuery
pipeline = dlt.pipeline(
    pipeline_name="marketing_data",
    destination="bigquery",
    dataset_name="marketing"
)

# Run it
load_info = pipeline.run(source)

This covers pagination, authentication, incremental loading via merge disposition, and BigQuery-specific optimizations — all in a configuration that an AI assistant can generate from an API spec.

The key features that eliminate boilerplate:

Automatic schema inference. dlt inspects the data and creates schemas. No manual column-by-column mapping.
Schema evolution. When source schemas change (a constant problem with ad platform APIs), dlt handles it automatically. New columns are added. Type changes are managed. This eliminates one of the largest maintenance burdens of custom pipelines.
Incremental loading. The write_disposition: "merge" configuration handles state management declaratively. No manual tracking of what’s been loaded, no checkpoint management.
Nested JSON handling. Nested JSON structures flatten automatically into child tables with configurable nesting depth. Marketing API responses are often deeply nested, and dlt handles this without custom flattening logic.

BigQuery-Specific Features

dlt isn’t warehouse-agnostic in the way that some tools are — it has specific optimizations for each destination. For BigQuery:

GCS staging for large loads. Instead of streaming data directly to BigQuery (which costs $0.01 per 200 MB), dlt can stage through Google Cloud Storage and use free batch loading. For marketing data pipelines that move gigabytes daily, this is a meaningful cost difference. See BigQuery Cost Model for why batch loading versus streaming insert costs matter.

Partitioning and clustering. The bigquery_adapter() function lets you configure partition columns and clustering keys as part of the pipeline definition:

from dlt.destinations.adapters import bigquery_adapter

# Apply BigQuery-specific optimizations
bigquery_adapter(
    source.campaigns,
    partition="date",
    cluster=["campaign_id", "ad_group_id"]
)

For marketing data, partitioning by date and clustering on campaign or ad group IDs is the standard configuration. These optimizations are simple configuration, not custom engineering.

Streaming inserts. For low-latency scenarios where data needs to be queryable within seconds, dlt supports BigQuery streaming inserts. Most marketing data pipelines don’t need this (hourly or daily granularity is sufficient), but it’s available for real-time bidding or alerting use cases.

Automatic table naming. Data lands in datasets with tables named after resources. The pipeline above creates a marketing.campaigns table. Nested JSON structures produce child tables like marketing.campaigns__tags with referential keys back to the parent.

Production Results

Artsy replaced a 10-year-old Ruby pipeline with dlt. Load times dropped from 2.5 hours to under 30 minutes. Some pipelines improved 98%, with cost savings of 96% or more.

One user reported ETL cost reductions of 182x per month after switching from Fivetran to dlt, reflecting the shift from per-row managed pricing to a library running on existing infrastructure.

In September 2024, users created 50,000 custom connectors — a 20x increase from the start of that year. Monthly downloads reached 3 million. The library passed the 1.0 stability milestone at version 1.19. PostHog runs it in production.

The AI + dlt Workflow

The practical workflow looks like this:

Identify the API. Read the source’s API documentation. Understand the endpoints, authentication method, pagination style, and rate limits.
Generate with AI. Give the AI the API documentation and dlt’s REST API reference. Ask it to generate the pipeline configuration. For well-documented APIs (Google Ads, Meta, most SaaS platforms), AI generates a working configuration in minutes.
Add BigQuery optimizations. Configure partitioning, clustering, and GCS staging. These are standard configurations that AI handles well.
Handle edge cases manually. Rate limiting nuances, undocumented API quirks, business-specific transformation logic. This is where human judgment matters most.
Test incrementally. Run against a small data slice. Verify schema inference. Check that incremental loading handles updates correctly. Promote to production.

Authentication handling, error management, and deployment scripts become reusable across pipelines. Each additional connector takes less time as patterns accumulate.

The build-vs-buy economics note covers how this workflow changes per-connector cost estimates from 50–100 hours to 10–20 hours for standard API patterns, and how the framework handles the ongoing maintenance that previously consumed 44% of engineering time.