Adrienne Vermorel

Deploying dbt Core on a Google Cloud Function

April 6, 2024

In a previous article, I talked about the differences between dbt Core and dbt Cloud, and how dbt Core does not provide a way to schedule and orchestrate the data transformations. Thus, if dbt Cloud isn’t your choice for orchestrating data, you will need to find another tool to do it.

My preferred way to do it is through Google Cloud Functions as I am usually using BigQuery for my data warehouse, which is also on Google Cloud Platform (GCP).

Other options could be to deploy through a virtual machine (but that would probably be too much for a process that often takes only minutes) or through a container (but this requires more advanced knowledge of coding and infrastructure management).

Below you will find a guide to deploying dbt Core on a Google Cloud Function.

Pre-requisites

You need to have a dbt Core project repository
You need to have access to a GCP project: I usually set up the orchestration in the project that already hosts my data transformation

Modify the dbt Core repository

First, we need to restructure the repository by putting all the current dbt folders and files in a sub-directory. I called it “dbt_transform” but you can name it however you want.

|-- dbt_transform
    |-- analyses
    |-- models
    |-- seeds
    |-- snapshots
    |-- tests
    |-- dbt_project.yml
    |-- packages.yml
    |-- profiles.yml

Note that I have a profiles.yml file in my dbt_transform folder. In this profiles.yml file, I use the oauth method:

dbt_project_name:
  outputs:
    dev:
      dataset: dbt
      job_execution_timeout_seconds: 3500
      job_retries: 1
      location: EU
      method: oauth
      priority: interactive
      project: gcp_project_name
      threads: 4
      type: bigquery
  target: dev

Add the main.py and requirements.txt

At the root of the directory, create a main.py file:

import os
import subprocess
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)


def run_dbt(request):
    try:
        # Set your dbt profiles directory (assuming it's in /workspace)
        os.environ['DBT_PROFILES_DIR'] = '/workspace/dbt_transform'

        # Log the current working directory and list files
        dbt_project_dir = '/workspace/dbt_transform'
        os.chdir(dbt_project_dir)

        # Log the current working directory and list files
        logging.info(f"Current working directory: {os.getcwd()}")
        logging.info(f"Files in the current directory: {os.listdir('.')}")

        # Install dbt packages
        logging.info("Installing dbt packages...")
        subprocess.run(['dbt', 'deps'], check=True,
                       capture_output=True, text=True)

        # Run dbt command (e.g., dbt run)

        result = subprocess.run(
            ['dbt', 'build'],
            capture_output=True,
            text=True
        )

        # Return dbt output
        return result.stdout

    except subprocess.CalledProcessError as e:
        # If a command fails, log its output and error
        logging.error(
            f"Command '{e.cmd}' returned non-zero exit status {e.returncode}.")
        logging.error(f"stdout: {e.stdout}")
        logging.error(f"stderr: {e.stderr}")
        return f"Error running dbt: {e.stderr}"

    except Exception as e:
        logging.error(f"Error running dbt: {str(e)}")
        return f"Error running dbt: {str(e)}"

Then, add the requirements.txt to manage the dependencies:

dbt-core
dbt-bigquery

Now our repository looks like this:

|-- dbt_transform
    |-- analyses
    |-- models
    |-- seeds
    |-- snapshots
    |-- tests
    |-- dbt_project.yml
    |-- packages.yml
    |-- profiles.yml
|-- main.py
|-- requirements.txt

Set up the service account for dbt

You will need to set up a service account for dbt with the following roles:

BigQuery Data Viewer and BigQuery Job User on the projects/datasets that host the sources of your dbt project
BigQuery Data Editor and BigQuery User on the project where your data transformations take place
Cloud Function Invoker on the project where your Cloud Function is going to run

To create this service account, you can run the Cloud Shell script (shortcut G + S) below.

You will need to set up the following variables: TRANSFORM_PROJECT_ID, FUNCTION_PROJECT_ID and SOURCE_PROJECT_IDS.

# Variables to set up
TRANSFORM_PROJECT_ID="transform-project-id"
FUNCTION_PROJECT_ID="function-project-id"
SOURCE_PROJECT_IDS=("source-project-id-1" "source-project-id-2" "source-project-id-N")

SERVICE_ACCOUNT_NAME="dbt-transform-sa"
SERVICE_ACCOUNT_DISPLAY_NAME="dbt Transformation Service Account"

# Create service account in the data transformations project
gcloud iam service-accounts create "$SERVICE_ACCOUNT_NAME" \
    --description="Service account for dbt data transformations" \
    --display-name="$SERVICE_ACCOUNT_DISPLAY_NAME" \
    --project="$TRANSFORM_PROJECT_ID"

# Format the service account email
SERVICE_ACCOUNT_EMAIL="${SERVICE_ACCOUNT_NAME}@${TRANSFORM_PROJECT_ID}.iam.gserviceaccount.com"

echo "Service account $SERVICE_ACCOUNT_EMAIL created."

# Loop through the source project IDs to assign roles
for PROJECT_ID in "${SOURCE_PROJECT_IDS[@]}"
do
    gcloud projects add-iam-policy-binding "$PROJECT_ID" \
        --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \
        --role="roles/bigquery.dataViewer"

    gcloud projects add-iam-policy-binding "$PROJECT_ID" \
        --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \
        --role="roles/bigquery.jobUser"

    echo "Roles assigned in the source project $PROJECT_ID."
done

# Assign roles in the transformation project
gcloud projects add-iam-policy-binding "$TRANSFORM_PROJECT_ID" \
    --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \
    --role="roles/bigquery.dataEditor"

gcloud projects add-iam-policy-binding "$TRANSFORM_PROJECT_ID" \
    --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \
    --role="roles/bigquery.user"

echo "Roles assigned in the transformation project."

# Assign Cloud Function Invoker role in the project where Cloud Function will run
gcloud projects add-iam-policy-binding "$FUNCTION_PROJECT_ID" \
    --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \
    --role="roles/cloudfunctions.invoker"

echo "Cloud Function Invoker role assigned."

echo "Setup complete."

Deploy the Cloud Function

To deploy the Cloud Function, you will need to have the gcloud CLI installed.

Once installed, you can ensure that you are logged in with the correct Google user account by running the following script in your local terminal.

gcloud auth login

Now, make sure that you are at the root of your project and deploy the Cloud Function by running the script below.

You will need to update the address of your service account and the region you want to use.

Here is a list of regions available, choose a region where 2nd gen functions are available.

gcloud functions deploy dbt_run \
--region=europe-west1 \
--service-account=dbt-transform-[email protected] \
--gen2 \
--runtime=python310 \
--entry-point=run_dbt \
--trigger-http \
--timeout=3500 \
--memory=1G

Now that the Cloud Function is deployed, you can schedule its run.

Set up Cloud Scheduler

You can set up the Cloud Scheduler by running the script below in Google Cloud Shell.

Configure the FUNCTION_URL, SCHEDULER_JOB_NAME, PROJECT_ID, LOCATION and SERVICE_ACCOUNT_EMAIL, TIME_ZONE and CRON_SCHEDULE with the right values.

You can find here a list of time zones in the tz format and a tool to create CRON schedules.

# Variables
FUNCTION_URL="YOUR_CLOUD_FUNCTION_TRIGGER_URL"
SCHEDULER_JOB_NAME="daily-dbt-run-job"
PROJECT_ID="YOUR_PROJECT_ID"
LOCATION="YOUR_LOCATION_ID"
SERVICE_ACCOUNT_EMAIL="YOUR_SERVICE_ACCOUNT_EMAIL"
TIME_ZONE="YOUR_TIME_ZONE"
CRON_SCHEDULE="0 7 * * *" # Change this to your desired schedule (daily at 7 AM by default)

# Step 1: Ensure the service account has the "Service Account Token Creator" role
gcloud iam service-accounts add-iam-policy-binding $SERVICE_ACCOUNT_EMAIL \
    --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \
    --role="roles/iam.serviceAccountTokenCreator" \
    --project=$PROJECT_ID

echo "Added Service Account Token Creator role to $SERVICE_ACCOUNT_EMAIL."

# Step 2: Create the Cloud Scheduler job
gcloud scheduler jobs create http $SCHEDULER_JOB_NAME \
    --location=$LOCATION \
    --schedule="$CRON_SCHEDULE" \
    --http-method=GET \
    --uri=$FUNCTION_URL \
    --oidc-service-account-email=$SERVICE_ACCOUNT_EMAIL \
    --oidc-token-audience=$FUNCTION_URL \
    --time-zone="$TIME_ZONE" \
    --project=$PROJECT_ID

echo "Cloud Scheduler job created: $SCHEDULER_JOB_NAME"

Congratulations, you are done orchestrating dbt Core with a Cloud Function. 💪

Further Reading