Adrienne Vermorel
Further Reading
Loading Data Made Simple: A Hands-on guide to dlt
As I work with data, I face a common challenge: moving data reliably from various sources into destinations where it can be analyzed. While this sounds straightforward, implementing proper data pipelines can involve complex infrastructure, specialized knowledge, and extensive maintenance.
Lately I have been using dlt (Data Load Tool) – a Python framework that simplifies data loading while following data engineering best practices. What I enjoy about dlt is that it makes building robust data pipelines accessible, even for people like me that have only basic Python skills.
Why dlt Stands Out
- Python-first approach: If you know basic Python, you can build production-ready data pipelines. No specialized data engineering expertise required.
- Zero infrastructure setup: Run pipelines locally during development or deploy them to serverless environments like GCP Cloud Functions or AWS Lambda without complex configuration.
- Built-in best practices: Benefit from incremental loading, data typing, normalization, and error handling without writing boilerplate code.
In this tutorial, we’ll build several progressively complex pipelines using GitHub API. You’ll learn how to:
- Get a list of the repos belonging to the PokeAPI organization
- Get the commits from the repos listed previously
- Implement incremental loading to efficiently process only new activities from the repos
This should give you enough skills to use dlt to build your own data pipelines. 🎉
Let’s get started with setting up our environment and understanding the basics of dlt.
Prerequisites for dlt
Before diving into building data pipelines with dlt, let’s set up our environment. This tutorial assumes you have basic familiarity with Python and using a command line interface.
Python Installation
dlt requires Python 3.9 or newer. If you don’t have Python installed:
Download and install Python: Visit python.org and download the latest version for your operating system.
- On Windows: Run the installer and make sure to check “Add Python to PATH”
- On macOS: Use the official installer or install via Homebrew with
brew install python
- On Linux: Most distributions come with Python pre-installed, or use your package manager (e.g.,
apt install python3
)
Verify your installation: Open a terminal or command prompt and run:
python --version
or
python3 --version
Setting Up a Virtual Environment
It’s best practice to use a virtual environment for Python projects to manage dependencies:
Create a new project directory:
mkdir dlt-tutorialcd dlt-tutorial
Create a virtual environment:
python -m venv venv
or
python3 -m venv venv
Activate the virtual environment:
- On Windows:
Terminal window venv\Scripts\activate - On macOS/Linux:
Terminal window source venv/bin/activate
You should see the environment name in your prompt, like (venv)
.
Installing dlt
With your virtual environment activated, install dlt and its dependencies:
pip install dlt
If you plan to use specific destinations (databases), you need to install the appropriate packages. In our case, we will use DuckDB as our destination, so we will need to run this:
pip install "dlt[duckdb]"
Core dlt Concepts
Pipelines
A pipeline is the main unit of work in dlt. It’s responsible for extracting data from sources, applying transformations, and loading the data into a destination. Pipelines handle:
- Connecting to data sources
- Managing schema evolution
- Tracking load state for incremental loading
- Error handling and retries
Sources
A source is where your data comes from. In dlt, sources can be:
- APIs (like the GitHub API we’ll use)
- Databases
- Files (CSV, JSON, etc.)
- Streaming platforms
- Custom Python objects
dlt provides verified sources for common sources, including a generic REST API source that we’ll use in this tutorial.
Resources
A resource is a specific dataset within a source. For example, if GitHub is our source, then “organizations”, “repositories” and “commits” could be separate resources.
Resources:
- Define the structure of your data
- Can be normalized into separate tables
- May have parent-child relationships
- Can be incrementally loaded
Destinations
A destination is where your data gets loaded. dlt supports various data warehouses and databases:
- Many SQL databases (powered by SQLAlchemy)
- BigQuery
- Snowflake
- Redshift
- DuckDB (great for development)
- And more…
A Simple GitHub Pipeline Using the REST API Source
Let’s build our first dlt pipeline to fetch repositories from the GitHub API using dlt’s built-in REST API source.
Step 1: Initialize dlt
First, we need to initialize dlt in the repo. We will be using the REST API source and the DuckDB destination.
dlt init rest_api duckdb
This will create a .dlt
directory in the project as well as a rest_api_pipeline.py
file. It will also create a .gitignore
file to avoid committing the secret.toml
file to the repo, along with some other files specific to dlt.
Step 2: Set up authentication
To set up authentication and increase the rate limit, we need to have a GitHub account and set up a personal access token.
Give it a name, select Public repositories
for repository access.
For repository access, select Public repositories
and click Generate token
.
Once you have the key, you can add it to the .dlt/secrets.toml file:
[sources.github]api_key ="api-key"
Step 3: Create the Pipeline
In the rest_api_pipeline.py
file, delete the existing code and replace it with the following:
import dltfrom dlt.sources.rest_api import rest_api_sourcefrom dlt.sources.helpers.rest_client.auth import APIKeyAuthfrom dlt.sources.helpers.rest_client.paginators import HeaderLinkPaginator
github_source = rest_api_source( { "client": { "base_url": "https://api.github.com/", "paginator": HeaderLinkPaginator(links_next_key="next"), "auth": { "type": "bearer", "token": dlt.secrets["sources.github.api_key"], }, }, "resources": [ { "name": "orgs-pokeapi-repos", "endpoint": { "path": "orgs/PokeAPI/repos", }, }, ], })
pipeline = dlt.pipeline( pipeline_name="github_pipeline", destination="duckdb", dataset_name="github_data",)
try: # Run the pipeline with our source load_info = pipeline.run(github_source.with_resources( "orgs-pokeapi-repos"))
# Print information about the load print(f"Load info: {load_info}")
except Exception as e: print(f"Error occurred: {str(e)}")
Step 4: Run the Pipeline
Run the pipeline:
python3 rest_api_pipeline.py
Step 5: Examining the Results
After running the pipeline, dlt will create a DuckDB database in the main directory. You can explore it using the DuckDB CLI:
duckdb github_pipeline.duckdb
You can look at the tables using the following command:
SHOW ALL TABLES;
You should see the following tables:
- _dlt_loads
- _dlt_pipeline_state
- _dlt_version
- orgs_pokeapi_repos
- orgs_pokeapi_repos__topics
To see the schema of the orgs_pokeapi_repos table, you can run this:
DESCRIBE github_data.orgs_pokeapi_repos;
And to query the data, you can run this:
SELECT id, name, owner__url, stargazers_count FROM github_data.orgs_pokeapi_repos LIMIT 5;
Understanding the REST API Source Configuration
Let’s break down the key components of our REST API source configuration:
1. Client Configuration
"client": { "base_url": "https://api.github.com/", "paginator": HeaderLinkPaginator(links_next_key="next"), "auth": { "type": "bearer", "token": dlt.secrets["sources.github.api_key"], }, },
base_url
: The base URL for all API requestspaginator
: Configuration for handling pagination.auth
: Authentication configuration. Here we are using a bearer token.
The HeaderLinkPaginator is used by default when the API returns a Link Header. But it is best to specify it in the dlt configuration. Otherwise when the Link Header is missing, dlt may give a warning that there is no paginator detected. This can happen when only one page of result is returned by the API.
2. Resources Configuration
"resources": [ { "name": "orgs-pokeapi-repos", "endpoint": { "path": "orgs/PokeAPI/repos", }, }, ],
name
: Defines the resource name and resulting table nameendpoint
: Configures the API endpointpath
: The path to append to the base URL
What Happened Behind the Scenes?
When we ran the pipeline, dlt performed several tasks:
- API Requests: dlt sent HTTP requests to the GitHub API
- Pagination Handling: dlt automatically fetched all pages of data
- Schema Inference: dlt analyzed the data structure and created appropriate tables
- Data Normalization: Complex nested data was flattened into a relational structure (we got two tables: the main one -> orgs_pokeapi_repos, and a related table with the topics -> orgs_pokeapi_repos__topics)
- Data Loading: dlt loaded the data into the specified destination, in our case in a DuckDB file
Use the result of one resource for the configuration of another one
Next, we will list all the commits belonging to the repositories from the organization PokéAPI.
To do that, we will need to use the result from the endpoint /orgs/PokeAPI/repos
in order to make the call to /orgs/PokeAPI/{repo}/commits
We will update the resources list:
"resources": [ { "name": "pokeapi_repos", "endpoint": { "path": "orgs/PokeAPI/repos", }, }, { "name": "pokeapi_repos_commits", "endpoint": { "path": "repos/PokeAPI/{resources.pokeapi_repos.name}/commits", }, }, ],
And when we look at the github_pipeline.duckdb
file, we have now the following table: github_data.pokeapi_repos_commits
with all the commits from that repo.
Implementing Incremental Loading with the GitHub API
So far, we’ve built a pipeline that extracts all the commits from the PokeAPI repositories. But what if we want to regularly update our database with only new commits, without processing everything again? For this, we can use incremental loading.
To do that, we need to determine in the API two things:
- what date/timestamp parameters we can send to the API to exclude commits before a certain date (in GitHub, it is
since
) - what field returned by the endpoint specifies the date/timestamp of the commit (we can use the
commit.author.date
)
Understanding Incremental Loading in dlt
Incremental loading is a technique where only new or modified data is processed in each pipeline run. This approach:
- Reduces processing time and resource usage
- Minimizes API requests (and this is important for API that have rate limits)
- Creates an overall more efficient data pipeline
dlt handles the complexity of incremental loading through its state management capabilities:
- It tracks what data has already been loaded
- It uses resource state to determine what to load next
- It manages this automatically with just a few configuration changes
Adding Parameters to Our Resource
Let’s modify our pipeline to include an incremental loading pattern. First, we’ll add the configuration for the commits resource:
{ "name": "pokeapi_repos_commits", "endpoint": { "path": "/repos/PokeAPI/{resources.pokeapi_repos.name}/commits", "params": { "since": { "type": "incremental", "cursor_path": "commit.author.date", "initial_value": "2024-01-01T00:00:00Z" } } }, },
- In the params object, we pass
since
the parameter GitHub expects to filter out data based on a timestamp. - We specify that this parameter is of type
incremental
. - We provide the cursor_path
commit.author.date
to track the timestamp from the endpoint response. - And for initial value, we provide the value that will initialize the state of incremental loading. It is important that the value type matches the type of the field in the data.
Running the pipeline
Now when we run the pipeline for the first time, dlt processes all the data since 2024-01-01 for the commit resource. But on subsequent runs, it will process only commits since the last run.
How dlt Manages Incremental State
Behind the scenes, dlt maintains state information in your destination. This includes:
- Load Metadata: Information about each pipeline run stored in
_dlt_loads
- Pipeline State: The state parameters for each resource in
_dlt_pipeline_state
- Data Versioning: Which version of each record was loaded in which run
With this, we can implement incremental data loading with minimal code.
To recap
In this article, we learned about dlt and its most important concepts. We deployed a basic pipeline from GitHub to DuckDB. We then used the result of one endpoint to configure another endpoint by grabbing the names of repositories in order to get the commits of those repositories. Finally, we implemented incremental loading to get only new data.