Adrienne Vermorel

Loading Data Made Simple: A Hands-on guide to dlt

As I work with data, I face a common challenge: moving data reliably from various sources into destinations where it can be analyzed. While this sounds straightforward, implementing proper data pipelines can involve complex infrastructure, specialized knowledge, and extensive maintenance.

Lately I have been using dlt (Data Load Tool) – a Python framework that simplifies data loading while following data engineering best practices. What I enjoy about dlt is that it makes building robust data pipelines accessible, even for people like me that have only basic Python skills.

Why dlt Stands Out

  • Python-first approach: If you know basic Python, you can build production-ready data pipelines. No specialized data engineering expertise required.
  • Zero infrastructure setup: Run pipelines locally during development or deploy them to serverless environments like GCP Cloud Functions or AWS Lambda without complex configuration.
  • Built-in best practices: Benefit from incremental loading, data typing, normalization, and error handling without writing boilerplate code.

In this tutorial, we’ll build several progressively complex pipelines using GitHub API. You’ll learn how to:

  1. Get a list of the repos belonging to the PokeAPI organization
  2. Get the commits from the repos listed previously
  3. Implement incremental loading to efficiently process only new activities from the repos

This should give you enough skills to use dlt to build your own data pipelines. 🎉

Let’s get started with setting up our environment and understanding the basics of dlt.

Prerequisites for dlt

Before diving into building data pipelines with dlt, let’s set up our environment. This tutorial assumes you have basic familiarity with Python and using a command line interface.

Python Installation

dlt requires Python 3.9 or newer. If you don’t have Python installed:

Download and install Python: Visit python.org and download the latest version for your operating system.

  • On Windows: Run the installer and make sure to check “Add Python to PATH”
  • On macOS: Use the official installer or install via Homebrew with brew install python
  • On Linux: Most distributions come with Python pre-installed, or use your package manager (e.g., apt install python3)

Verify your installation: Open a terminal or command prompt and run:

Terminal window
python --version

or

Terminal window
python3 --version

Setting Up a Virtual Environment

It’s best practice to use a virtual environment for Python projects to manage dependencies:

Create a new project directory:

Terminal window
mkdir dlt-tutorial
cd dlt-tutorial

Create a virtual environment:

Terminal window
python -m venv venv

or

Terminal window
python3 -m venv venv

Activate the virtual environment:

  • On Windows:
    Terminal window
    venv\Scripts\activate
  • On macOS/Linux:
    Terminal window
    source venv/bin/activate

You should see the environment name in your prompt, like (venv).

Installing dlt

With your virtual environment activated, install dlt and its dependencies:

Terminal window
pip install dlt

If you plan to use specific destinations (databases), you need to install the appropriate packages. In our case, we will use DuckDB as our destination, so we will need to run this:

Terminal window
pip install "dlt[duckdb]"

Core dlt Concepts

Pipelines

A pipeline is the main unit of work in dlt. It’s responsible for extracting data from sources, applying transformations, and loading the data into a destination. Pipelines handle:

  • Connecting to data sources
  • Managing schema evolution
  • Tracking load state for incremental loading
  • Error handling and retries

Sources

A source is where your data comes from. In dlt, sources can be:

  • APIs (like the GitHub API we’ll use)
  • Databases
  • Files (CSV, JSON, etc.)
  • Streaming platforms
  • Custom Python objects

dlt provides verified sources for common sources, including a generic REST API source that we’ll use in this tutorial.

Resources

A resource is a specific dataset within a source. For example, if GitHub is our source, then “organizations”, “repositories” and “commits” could be separate resources.

Resources:

  • Define the structure of your data
  • Can be normalized into separate tables
  • May have parent-child relationships
  • Can be incrementally loaded

Destinations

A destination is where your data gets loaded. dlt supports various data warehouses and databases:

  • Many SQL databases (powered by SQLAlchemy)
  • BigQuery
  • Snowflake
  • Redshift
  • DuckDB (great for development)
  • And more…

A Simple GitHub Pipeline Using the REST API Source

Let’s build our first dlt pipeline to fetch repositories from the GitHub API using dlt’s built-in REST API source.

Step 1: Initialize dlt

First, we need to initialize dlt in the repo. We will be using the REST API source and the DuckDB destination.

Terminal window
dlt init rest_api duckdb

This will create a .dlt directory in the project as well as a rest_api_pipeline.py file. It will also create a .gitignore file to avoid committing the secret.toml file to the repo, along with some other files specific to dlt.

Step 2: Set up authentication

To set up authentication and increase the rate limit, we need to have a GitHub account and set up a personal access token. Give it a name, select Public repositories for repository access. For repository access, select Public repositories and click Generate token. Once you have the key, you can add it to the .dlt/secrets.toml file:

[sources.github]
api_key ="api-key"

Step 3: Create the Pipeline

In the rest_api_pipeline.py file, delete the existing code and replace it with the following:

import dlt
from dlt.sources.rest_api import rest_api_source
from dlt.sources.helpers.rest_client.auth import APIKeyAuth
from dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator
github_source = rest_api_source(
{
"client": {
"base_url": "https://api.github.com/",
"paginator": HeaderLinkPaginator(links_next_key="next"),
"auth": {
"type": "bearer",
"token": dlt.secrets["sources.github.api_key"],
},
},
"resources": [
{
"name": "orgs-pokeapi-repos",
"endpoint": {
"path": "orgs/PokeAPI/repos",
},
},
],
}
)
pipeline = dlt.pipeline(
pipeline_name="github_pipeline",
destination="duckdb",
dataset_name="github_data",
)
try:
# Run the pipeline with our source
load_info = pipeline.run(github_source.with_resources(
"orgs-pokeapi-repos"))
# Print information about the load
print(f"Load info: {load_info}")
except Exception as e:
print(f"Error occurred: {str(e)}")

Step 4: Run the Pipeline

Run the pipeline:

Terminal window
python3 rest_api_pipeline.py

Step 5: Examining the Results

After running the pipeline, dlt will create a DuckDB database in the main directory. You can explore it using the DuckDB CLI:

Terminal window
duckdb github_pipeline.duckdb

You can look at the tables using the following command:

Terminal window
SHOW ALL TABLES;

You should see the following tables:

  • _dlt_loads
  • _dlt_pipeline_state
  • _dlt_version
  • orgs_pokeapi_repos
  • orgs_pokeapi_repos__topics

To see the schema of the orgs_pokeapi_repos table, you can run this:

Terminal window
DESCRIBE github_data.orgs_pokeapi_repos;

And to query the data, you can run this:

Terminal window
SELECT id, name, owner__url, stargazers_count FROM github_data.orgs_pokeapi_repos LIMIT 5;

Understanding the REST API Source Configuration

Let’s break down the key components of our REST API source configuration:

1. Client Configuration

"client": {
"base_url": "https://api.github.com/",
"paginator": HeaderLinkPaginator(links_next_key="next"),
"auth": {
"type": "bearer",
"token": dlt.secrets["sources.github.api_key"],
},
},
  • base_url: The base URL for all API requests
  • paginator: Configuration for handling pagination.
  • auth: Authentication configuration. Here we are using a bearer token.

The HeaderLinkPaginator is used by default when the API returns a Link Header. But it is best to specify it in the dlt configuration. Otherwise when the Link Header is missing, dlt may give a warning that there is no paginator detected. This can happen when only one page of result is returned by the API.

2. Resources Configuration

"resources": [
{
"name": "orgs-pokeapi-repos",
"endpoint": {
"path": "orgs/PokeAPI/repos",
},
},
],
  • name: Defines the resource name and resulting table name
  • endpoint: Configures the API endpoint
    • path: The path to append to the base URL

What Happened Behind the Scenes?

When we ran the pipeline, dlt performed several tasks:

  1. API Requests: dlt sent HTTP requests to the GitHub API
  2. Pagination Handling: dlt automatically fetched all pages of data
  3. Schema Inference: dlt analyzed the data structure and created appropriate tables
  4. Data Normalization: Complex nested data was flattened into a relational structure (we got two tables: the main one -> orgs_pokeapi_repos, and a related table with the topics -> orgs_pokeapi_repos__topics)
  5. Data Loading: dlt loaded the data into the specified destination, in our case in a DuckDB file

Use the result of one resource for the configuration of another one

Next, we will list all the commits belonging to the repositories from the organization PokéAPI. To do that, we will need to use the result from the endpoint /orgs/PokeAPI/repos in order to make the call to /orgs/PokeAPI/{repo}/commits

We will update the resources list:

"resources": [
{
"name": "pokeapi_repos",
"endpoint": {
"path": "orgs/PokeAPI/repos",
},
},
{
"name": "pokeapi_repos_commits",
"endpoint": {
"path": "repos/PokeAPI/{resources.pokeapi_repos.name}/commits",
},
},
],

And when we look at the github_pipeline.duckdb file, we have now the following table: github_data.pokeapi_repos_commits with all the commits from that repo.

Implementing Incremental Loading with the GitHub API

So far, we’ve built a pipeline that extracts all the commits from the PokeAPI repositories. But what if we want to regularly update our database with only new commits, without processing everything again? For this, we can use incremental loading.

To do that, we need to determine in the API two things:

  • what date/timestamp parameters we can send to the API to exclude commits before a certain date (in GitHub, it is since)
  • what field returned by the endpoint specifies the date/timestamp of the commit (we can use the commit.author.date)

Understanding Incremental Loading in dlt

Incremental loading is a technique where only new or modified data is processed in each pipeline run. This approach:

  • Reduces processing time and resource usage
  • Minimizes API requests (and this is important for API that have rate limits)
  • Creates an overall more efficient data pipeline

dlt handles the complexity of incremental loading through its state management capabilities:

  1. It tracks what data has already been loaded
  2. It uses resource state to determine what to load next
  3. It manages this automatically with just a few configuration changes

Adding Parameters to Our Resource

Let’s modify our pipeline to include an incremental loading pattern. First, we’ll add the configuration for the commits resource:

{
"name": "pokeapi_repos_commits",
"endpoint": {
"path": "/repos/PokeAPI/{resources.pokeapi_repos.name}/commits",
"params": {
"since": {
"type": "incremental",
"cursor_path": "commit.author.date",
"initial_value": "2024-01-01T00:00:00Z"
}
}
},
},
  • In the params object, we pass since the parameter GitHub expects to filter out data based on a timestamp.
  • We specify that this parameter is of type incremental.
  • We provide the cursor_path commit.author.date to track the timestamp from the endpoint response.
  • And for initial value, we provide the value that will initialize the state of incremental loading. It is important that the value type matches the type of the field in the data.

Running the pipeline

Now when we run the pipeline for the first time, dlt processes all the data since 2024-01-01 for the commit resource. But on subsequent runs, it will process only commits since the last run.

How dlt Manages Incremental State

Behind the scenes, dlt maintains state information in your destination. This includes:

  1. Load Metadata: Information about each pipeline run stored in _dlt_loads
  2. Pipeline State: The state parameters for each resource in _dlt_pipeline_state
  3. Data Versioning: Which version of each record was loaded in which run

With this, we can implement incremental data loading with minimal code.

To recap

In this article, we learned about dlt and its most important concepts. We deployed a basic pipeline from GitHub to DuckDB. We then used the result of one endpoint to configure another endpoint by grabbing the names of repositories in order to get the commits of those repositories. Finally, we implemented incremental loading to get only new data.