dbt Docker Containerization

Containerizing dbt Core for production means building a Docker image that encapsulates your dbt runtime, dependencies, and project code into a reproducible, portable artifact. Whether you deploy to Cloud Run Jobs, Kubernetes, or any container runtime, the containerization patterns are the same.

The container approach decouples dbt execution from any specific orchestration platform. Your dbt project runs identically in Cloud Run Jobs today and Kubernetes tomorrow. The orchestrator becomes interchangeable — it just decides when the container runs, not how.

Multi-Stage Dockerfile

A multi-stage build keeps images small while ensuring reproducibility. The first stage installs dependencies (including build tools like git); the second stage copies only the runtime artifacts:

# Build stage
FROM python:3.11-slim as builder

WORKDIR /app

# Install build dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install dbt with pinned versions
RUN pip install --no-cache-dir \
    dbt-core==1.9.0 \
    dbt-bigquery==1.9.0

# Copy dbt project
COPY dbt_project/ /app/dbt_project/
COPY profiles.yml /app/profiles.yml

# Runtime stage
FROM python:3.11-slim

WORKDIR /app

# Copy installed packages from builder
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin/dbt /usr/local/bin/dbt

# Copy dbt project
COPY --from=builder /app/dbt_project /app/dbt_project
COPY --from=builder /app/profiles.yml /app/profiles.yml

# Set working directory to dbt project
WORKDIR /app/dbt_project

# Default command
CMD ["dbt", "build", "--profiles-dir", "/app"]

The build stage needs git for packages that install from Git repositories. The runtime stage doesn’t. Multi-stage builds let you include build tools without bloating the final image. The result is a smaller image that’s faster to pull and has a smaller attack surface.

Version Pinning

Pin exact versions of dbt-core and adapters. Always.

dbt-core==1.9.0
dbt-bigquery==1.9.0

Using latest or unpinned versions creates debugging nightmares when behavior changes between runs. A model that succeeded on Monday fails on Wednesday because a minor version bump changed how a macro resolves. The error message points to your SQL, not the version change. You spend hours debugging transformation logic when the issue is infrastructure drift.

Pin the Python base image too. python:3.11-slim is better than python:3-slim because the latter silently upgrades when Python 3.12 or 3.13 becomes the default tag. For maximum reproducibility, pin the digest: python:3.11-slim@sha256:abc123.... Most teams find pinning to the minor version (3.11) strikes the right balance between reproducibility and maintenance burden.

The Two-Repository Strategy

Separate your dbt project from your Docker image definition. This two-repository approach enables independent development cycles:

dbt-project-repo/
├── models/
├── macros/
├── tests/
├── dbt_project.yml
└── profiles.yml

dbt-runner-repo/
├── Dockerfile
├── cloudbuild.yaml
└── scripts/
    └── run-dbt.sh

Data analysts update SQL models without touching infrastructure. Platform engineers update the container — Python versions, dbt versions, build configuration — without modifying transformation logic. The merge conflict surface area between these two groups drops to zero.

Two approaches for getting the dbt project into the container:

Bake models into the image during CI/CD. The CI pipeline clones the dbt project repository and copies models into the image at build time. This provides version control — every image is a snapshot of a specific commit. You can roll back to a previous image if a deployment introduces regressions. Most teams should start here.

Clone the dbt project at runtime. The container clones the repository when it starts. This adds flexibility — you can point to different branches or tags via environment variables — but sacrifices reproducibility. The same image might produce different results depending on what’s in the repository when it runs. Reserve this approach for development or staging environments where you need rapid iteration.

Building and Pushing to Artifact Registry

GCP’s Artifact Registry stores your container images. Create a repository, then use Cloud Build to build and push:

# Create repository if it doesn't exist
gcloud artifacts repositories create dbt-images \
    --repository-format=docker \
    --location=us-central1 \
    --description="dbt Docker images"

# Build and push
gcloud builds submit \
    --tag us-central1-docker.pkg.dev/PROJECT_ID/dbt-images/dbt-runner:v1.0.0

Tag images with semantic versions (v1.0.0, v1.1.0) rather than latest. When a production issue occurs, you need to know exactly which image version is running. The latest tag is mutable — it points to whatever was pushed most recently — so it tells you nothing useful during incident response.

For automated builds, a cloudbuild.yaml in the runner repository triggers builds on push:

steps:
  - name: 'gcr.io/cloud-builders/docker'
    args: ['build', '-t', '${_IMAGE_TAG}', '.']
  - name: 'gcr.io/cloud-builders/docker'
    args: ['push', '${_IMAGE_TAG}']
substitutions:
  _IMAGE_TAG: 'us-central1-docker.pkg.dev/${PROJECT_ID}/dbt-images/dbt-runner:${SHORT_SHA}'

Using ${SHORT_SHA} as the tag ties every image to its source commit. Combined with the bake-in approach, this gives you full traceability: from a running container back to the exact code that built it.

profiles.yml for Containerized dbt

The profiles.yml inside a container should be environment-agnostic. Use env_var() for anything that varies between environments:

dbt_project:
  target: prod
  outputs:
    prod:
      type: bigquery
      method: oauth
      project: "{{ env_var('GCP_PROJECT') }}"
      dataset: "{{ env_var('DBT_DATASET', 'analytics') }}"
      location: "{{ env_var('BQ_LOCATION', 'US') }}"
      threads: 4
      timeout_seconds: 300

The method: oauth setting tells dbt-bigquery to use whatever credentials the runtime environment provides. In Cloud Run Jobs, that’s the attached service account via Workload Identity. In a local Docker container, it’s your ADC credentials mounted into the container. The same image works everywhere.

Default values in env_var('DBT_DATASET', 'analytics') prevent failures when an environment variable isn’t set, while still allowing override at deploy time. This keeps the image portable across dev, staging, and production without rebuilding.

When Containers Are Not Worth It

Not every dbt deployment needs containerization. If you’re running dbt locally or through dbt Cloud, the container overhead adds complexity without benefit. Containers earn their keep when:

You need reproducible production runs with pinned dependencies
Multiple environments (dev, staging, prod) must run from the same artifact
Your orchestration platform (Cloud Run, Kubernetes, Dagster) expects containers
You want to decouple dbt version management from developer machines

For a solo developer running dbt build from the command line against a dev dataset, a virtual environment with pip install dbt-core dbt-bigquery is simpler and sufficient.