ServicesAboutNotesContact Get in touch →
EN FR
Note

Building dlt Pipelines: From First Run to Incremental Loading

A reading path through the concepts in the hands-on dlt tutorial — environment setup, REST API Source config, dependent resources, and incremental loading.

Planted
dltdata engineeringetlincremental processing

The hands-on dlt tutorial — Loading Data Made Simple — walks through three progressively complex pipelines using the GitHub API: fetching an organization’s repositories, fetching commits from those repositories using dependent resources, and then adding incremental loading to only fetch new commits on subsequent runs.

This hub maps the concepts from that tutorial to individual garden notes, so you can go deep on any piece without re-reading the full article.

The Tutorial Progression

The tutorial is structured as a ladder. Each step adds one concept:

  1. Environment and project setup → First pipeline running locally against DuckDB
  2. REST API Source config → Understanding the client and resources blocks
  3. Dependent resources → Using one endpoint’s output to drive another
  4. Incremental loading → Fetching only new data on repeat runs

Following that order is the right way to read it if you’re new to dlt. If you already have a running pipeline and need to understand one specific piece, the notes below are self-contained.

Reading Path

dlt Environment Setup — The steps before any pipeline code: Python virtual environment, pip install "dlt[duckdb]", and dlt init rest_api duckdb. The dlt init command creates the project scaffold including the .dlt/ configuration directory, a starter pipeline file, and a pre-configured .gitignore that excludes secrets.toml. Read this first if you’re starting from zero.

dlt Core Concepts — Sources, resources, pipelines, and the three write dispositions (replace, append, merge). These are the vocabulary you need to understand everything else. The tutorial assumes this vocabulary without explaining it inline.

dlt REST API Source Configuration — The declarative config dictionary: the client block (base URL, auth, pagination), the resources block (endpoint paths, names), and what dlt does automatically — pagination, schema inference, nested JSON normalization into parent-child tables, metadata table creation. Covers the GitHub repository pipeline from the tutorial in detail.

dlt Dependent Resources — The {resources.parent.field} path template syntax that lets one endpoint’s output drive another endpoint’s URL. The tutorial’s second pipeline — fetching commits for each repository — uses this pattern. This note covers why it exists, how it composes with incremental loading, and what to watch for at scale (rate limits multiply with parent record count).

dlt Incremental Loading — How dlt tracks pipeline state between runs using cursor-based loading. The declarative config for the REST API Source uses "type": "incremental" with a cursor_path and initial_value. State is stored in _dlt_pipeline_state in your destination — visible and queryable. This note also covers how dlt’s extraction-layer incrementality complements dbt’s transformation-layer incremental models downstream.

dlt Secrets Management — The .dlt/secrets.toml file for local development, environment variables for CI/CD, and the naming convention dlt uses (SOURCES__GITHUB__API_KEY maps to sources.github.api_key). This is mentioned throughout the tutorial but explained in detail here.

The Tutorial in One Diagram

GitHub API
├── /orgs/PokeAPI/repos ← pokeapi_repos resource
│ (fetches all repos, paginated)
└── /repos/PokeAPI/{name}/commits ← pokeapi_repos_commits resource
(dependent on pokeapi_repos)
(incremental: since=last_run_date)
DuckDB (local) → BigQuery (production)
├── github_data.pokeapi_repos
├── github_data.pokeapi_repos_commits
└── github_data._dlt_* (metadata tables)

What the Tutorial Doesn’t Cover

The tutorial is intentionally scoped to the REST API Source and DuckDB. It doesn’t cover:

  • RESTClient — The imperative alternative for complex auth or custom pagination. See dlt RESTClient vs REST API Source for the decision.
  • BigQuery as destination — Staging vs. streaming inserts, partitioning config, cost implications. Covered in the dlt Python-Native Data Loading hub.
  • Production deployment — Cloud Run, GitHub Actions, Airflow, Modal. Covered in Building Custom API Pipelines with dlt.
  • Testing — Schema validation, incremental state testing, DuckDB-local test patterns.

The tutorial’s value is in showing a working end-to-end pipeline with minimal setup. The notes above let you go deeper on any individual piece.