ServicesAboutNotesContact Get in touch →
EN FR
Note

dlt and BigQuery Integration

How dlt loads data into BigQuery — the two loading strategies (streaming vs. GCS staging), the bigquery_adapter for partitioning and clustering, nested JSON normalization, and the metadata tables dlt creates.

Planted
dltbigquerydata engineeringetlcost optimization

dlt includes destination-specific optimizations for BigQuery. The two loading strategies, the bigquery_adapter, nested JSON handling, and the metadata tables dlt creates all affect pipeline design and operation.

Setup

Install dlt with the BigQuery extras:

Terminal window
pip install "dlt[bigquery]"

Configuration lives in secrets.toml — dlt’s convention for keeping credentials out of code:

[destination.bigquery]
project_id = "your-project"
private_key = "-----BEGIN PRIVATE KEY-----\n..."
client_email = "sa@your-project.iam.gserviceaccount.com"

The service account needs BigQuery Data Editor and BigQuery Job User roles at minimum. For GCS staging (covered below), it also needs Storage Object Creator on the staging bucket.

Two Loading Strategies

dlt offers two ways to get data into BigQuery. Which one you use has significant cost implications.

Streaming Inserts

Streaming inserts push rows directly into BigQuery tables as they’re loaded. Data is immediately queryable — rows appear in tables within seconds. This is the simpler configuration, and it works well for low-latency use cases where freshness matters more than cost.

The catch: BigQuery’s streaming insert API costs $0.01 per 200 MB. For small datasets this is negligible. For anything moving gigabytes daily — marketing data, event streams, high-volume APIs — streaming insert costs accumulate and should factor into your decision. See BigQuery Cost Model for why batch loading versus streaming insert costs matter at scale.

GCS Staging

GCS staging avoids streaming insert costs entirely. Instead of pushing rows directly, dlt:

  1. Uploads the data to a Cloud Storage bucket as Parquet or JSONL files
  2. Issues a BigQuery LOAD DATA command that reads from GCS

BigQuery batch loading from GCS is free. You pay for GCS storage (trivial for staging files) and the BigQuery query that loads them. For large datasets or pipelines that run frequently, the cost difference between streaming inserts and GCS staging is substantial.

Configure staging by specifying a staging bucket in your pipeline:

pipeline = dlt.pipeline(
pipeline_name="my_pipeline",
destination="bigquery",
staging="filesystem", # use GCS staging
dataset_name="raw_data"
)

And in config.toml:

[destination.filesystem]
bucket_url = "gs://your-staging-bucket"

For any pipeline beyond small datasets, GCS staging is the right default. The operational overhead is minimal and the cost savings are real.

Partitioning and Clustering with bigquery_adapter

The bigquery_adapter() function exposes BigQuery-specific table optimizations that dlt can’t infer from your data alone:

from dlt.destinations.adapters import bigquery_adapter
@dlt.resource
def events():
yield from get_events()
# Partition by date, cluster by user_id
bigquery_adapter(
events,
partition="event_date",
cluster=["user_id"]
)

This configures partitioning and clustering declaratively as part of the pipeline definition — not as a separate DDL step or post-load operation.

Partitioning divides the table into segments by a column value, typically a date. BigQuery only reads partitions that match your WHERE clause filters, which dramatically reduces scan costs for time-range queries. For any large fact table (events, transactions, API responses with timestamps), partitioning by date should be standard.

Clustering sorts data within partitions by specified columns. Queries that filter on clustered columns scan less data, even within a partition. For marketing data, clustering by campaign_id or ad_group_id reduces scan costs for the filtered queries analysts run most.

Both are configuration, not engineering work. AI assistants handle this correctly on the first try once you tell them what to partition and cluster on. See Partitioning vs. Clustering Decision Framework for the reasoning behind which columns to choose.

Nested JSON Normalization

APIs return nested JSON. Relational tables don’t support nesting. dlt bridges this automatically.

An API response like:

{
"id": 1,
"name": "Alice",
"addresses": [
{"city": "Berlin", "country": "DE"},
{"city": "Paris", "country": "FR"}
]
}

Becomes two tables in BigQuery:

  • users — with id, name, and dlt metadata columns
  • users__addresses — with city, country, _dlt_id (the parent row identifier), and _dlt_parent_id

The _dlt_id and _dlt_parent_id columns are dlt’s referential keys. You JOIN on them to reconstruct nested structures in SQL when you need them.

You control nesting depth with max_table_nesting. The default creates a child table for every level of nesting, which can produce proliferating tables for deeply nested responses:

@dlt.resource(max_table_nesting=2)
def users():
yield from get_users()

Setting max_table_nesting=2 or max_table_nesting=3 creates readable schemas without excessive table proliferation. Anything beyond 3 levels typically signals that the API response should be flattened before it reaches dlt.

Metadata Tables

dlt creates three metadata tables in your dataset. They’re prefixed with _dlt_ and contain operational information about the pipeline:

_dlt_loads tracks each pipeline run — when it ran, which resources were loaded, how many rows were processed, and whether the load succeeded. This is your first stop when debugging pipeline issues.

_dlt_pipeline_state stores incremental loading state — the cursor values that tell dlt where each resource left off. If you’re debugging why an incremental pipeline isn’t fetching new data, query this table first. See dlt Incremental Loading for how state tracking works.

_dlt_version records which version of dlt ran each load and the schema version at that point. Useful when troubleshooting schema evolution issues.

These tables are created automatically. You don’t need to provision them. They’re queryable like any other BigQuery table, and since they’re in the same dataset as your data, they’re immediately visible in any BigQuery console or SQL tool.

What This Means in Practice

For analytics engineers running dbt on top of dlt-loaded data, the practical setup looks like this:

  1. dlt loads raw API data into a raw dataset with GCS staging
  2. bigquery_adapter() configures partitioning and clustering at load time
  3. Nested JSON expands into parent-child tables automatically
  4. dbt sources reference the dlt-created tables, _dlt_ tables included if you need pipeline metadata in your models

The dlt metadata tables are particularly useful for freshness checks — query _dlt_loads to determine when a source was last successfully synced, and surface that in dbt freshness tests or Dagster freshness policies.

For the full picture of dlt pipeline construction, start with dlt Core Concepts. For how incremental loading interacts with BigQuery costs, see dlt Incremental Loading and BigQuery Cost Model.