Adrienne Vermorel
Further Reading
Deploying dbt Core on a Google Cloud Function
In a previous article, I talked about the differences between dbt Core and dbt Cloud, and how dbt Core does not provide a way to schedule and orchestrate the data transformations. Thus, if dbt Cloud isn’t your choice for orchestrating data, you will need to find another tool to do it.
My preferred way to do it is through Google Cloud Functions as I am usually using BigQuery for my data warehouse, which is also on Google Cloud Platform (GCP).
Other options could be to deploy through a virtual machine (but that would probably be too much for a process that often takes only minutes) or through a container (but this requires more advanced knowledge of coding and infrastructure management).
Below you will find a guide to deploying dbt Core on a Google Cloud Function.
Pre-requisites
- You need to have a dbt Core project repository
- You need to have access to a GCP project: I usually set up the orchestration in the project that already hosts my data transformation
Modify the dbt Core repository
First, we need to restructure the repository by putting all the current dbt folders and files in a sub-directory. I called it “dbt_transform” but you can name it however you want.
|-- dbt_transform |-- analyses |-- models |-- seeds |-- snapshots |-- tests |-- dbt_project.yml |-- packages.yml |-- profiles.yml
Note that I have a profiles.yml file in my dbt_transform folder. In this profiles.yml file, I use the oauth method:
dbt_project_name: outputs: dev: dataset: dbt job_execution_timeout_seconds: 3500 job_retries: 1 location: EU method: oauth priority: interactive project: gcp_project_name threads: 4 type: bigquery target: dev
Add the main.py and requirements.txt
At the root of the directory, create a main.py file:
import osimport subprocessimport logging
# Configure logginglogging.basicConfig(level=logging.INFO)
def run_dbt(request): try: # Set your dbt profiles directory (assuming it's in /workspace) os.environ['DBT_PROFILES_DIR'] = '/workspace/dbt_transform'
# Log the current working directory and list files dbt_project_dir = '/workspace/dbt_transform' os.chdir(dbt_project_dir)
# Log the current working directory and list files logging.info(f"Current working directory: {os.getcwd()}") logging.info(f"Files in the current directory: {os.listdir('.')}")
# Install dbt packages logging.info("Installing dbt packages...") subprocess.run(['dbt', 'deps'], check=True, capture_output=True, text=True)
# Run dbt command (e.g., dbt run)
result = subprocess.run( ['dbt', 'build'], capture_output=True, text=True )
# Return dbt output return result.stdout
except subprocess.CalledProcessError as e: # If a command fails, log its output and error logging.error( f"Command '{e.cmd}' returned non-zero exit status {e.returncode}.") logging.error(f"stdout: {e.stdout}") logging.error(f"stderr: {e.stderr}") return f"Error running dbt: {e.stderr}"
except Exception as e: logging.error(f"Error running dbt: {str(e)}") return f"Error running dbt: {str(e)}"
Then, add the requirements.txt to manage the dependencies:
dbt-coredbt-bigquery
Now our repository looks like this:
|-- dbt_transform |-- analyses |-- models |-- seeds |-- snapshots |-- tests |-- dbt_project.yml |-- packages.yml |-- profiles.yml|-- main.py|-- requirements.txt
Set up the service account for dbt
You will need to set up a service account for dbt with the following roles:
- BigQuery Data Viewer and BigQuery Job User on the projects/datasets that host the sources of your dbt project
- BigQuery Data Editor and BigQuery User on the project where your data transformations take place
- Cloud Function Invoker on the project where your Cloud Function is going to run
To create this service account, you can run the Cloud Shell script (shortcut G
+ S
) below.
You will need to set up the following variables: TRANSFORM_PROJECT_ID, FUNCTION_PROJECT_ID and SOURCE_PROJECT_IDS.
# Variables to set upTRANSFORM_PROJECT_ID="transform-project-id"FUNCTION_PROJECT_ID="function-project-id"SOURCE_PROJECT_IDS=("source-project-id-1" "source-project-id-2" "source-project-id-N")
SERVICE_ACCOUNT_NAME="dbt-transform-sa"SERVICE_ACCOUNT_DISPLAY_NAME="dbt Transformation Service Account"
# Create service account in the data transformations projectgcloud iam service-accounts create "$SERVICE_ACCOUNT_NAME" \ --description="Service account for dbt data transformations" \ --display-name="$SERVICE_ACCOUNT_DISPLAY_NAME" \ --project="$TRANSFORM_PROJECT_ID"
# Format the service account emailSERVICE_ACCOUNT_EMAIL="${SERVICE_ACCOUNT_NAME}@${TRANSFORM_PROJECT_ID}.iam.gserviceaccount.com"
echo "Service account $SERVICE_ACCOUNT_EMAIL created."
# Loop through the source project IDs to assign rolesfor PROJECT_ID in "${SOURCE_PROJECT_IDS[@]}"do gcloud projects add-iam-policy-binding "$PROJECT_ID" \ --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \ --role="roles/bigquery.dataViewer"
gcloud projects add-iam-policy-binding "$PROJECT_ID" \ --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \ --role="roles/bigquery.jobUser"
echo "Roles assigned in the source project $PROJECT_ID."done
# Assign roles in the transformation projectgcloud projects add-iam-policy-binding "$TRANSFORM_PROJECT_ID" \ --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \ --role="roles/bigquery.dataEditor"
gcloud projects add-iam-policy-binding "$TRANSFORM_PROJECT_ID" \ --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \ --role="roles/bigquery.user"
echo "Roles assigned in the transformation project."
# Assign Cloud Function Invoker role in the project where Cloud Function will rungcloud projects add-iam-policy-binding "$FUNCTION_PROJECT_ID" \ --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \ --role="roles/cloudfunctions.invoker"
echo "Cloud Function Invoker role assigned."
echo "Setup complete."
Deploy the Cloud Function
To deploy the Cloud Function, you will need to have the gcloud CLI installed.
Once installed, you can ensure that you are logged in with the correct Google user account by running the following script in your local terminal.
gcloud auth login
Now, make sure that you are at the root of your project and deploy the Cloud Function by running the script below.
You will need to update the address of your service account and the region you want to use.
Here is a list of regions available, choose a region where 2nd gen functions are available.
gcloud functions deploy dbt_run \--region=europe-west1 \--gen2 \--runtime=python310 \--entry-point=run_dbt \--trigger-http \--timeout=3500 \--memory=1G
Now that the Cloud Function is deployed, you can schedule its run.
Set up Cloud Scheduler
You can set up the Cloud Scheduler by running the script below in Google Cloud Shell.
Configure the FUNCTION_URL, SCHEDULER_JOB_NAME, PROJECT_ID, LOCATION and SERVICE_ACCOUNT_EMAIL, TIME_ZONE and CRON_SCHEDULE with the right values.
You can find here a list of time zones in the tz format and a tool to create CRON schedules.
# VariablesFUNCTION_URL="YOUR_CLOUD_FUNCTION_TRIGGER_URL"SCHEDULER_JOB_NAME="daily-dbt-run-job"PROJECT_ID="YOUR_PROJECT_ID"LOCATION="YOUR_LOCATION_ID"SERVICE_ACCOUNT_EMAIL="YOUR_SERVICE_ACCOUNT_EMAIL"TIME_ZONE="YOUR_TIME_ZONE"CRON_SCHEDULE="0 7 * * *" # Change this to your desired schedule (daily at 7 AM by default)
# Step 1: Ensure the service account has the "Service Account Token Creator" rolegcloud iam service-accounts add-iam-policy-binding $SERVICE_ACCOUNT_EMAIL \ --member="serviceAccount:$SERVICE_ACCOUNT_EMAIL" \ --role="roles/iam.serviceAccountTokenCreator" \ --project=$PROJECT_ID
echo "Added Service Account Token Creator role to $SERVICE_ACCOUNT_EMAIL."
# Step 2: Create the Cloud Scheduler jobgcloud scheduler jobs create http $SCHEDULER_JOB_NAME \ --location=$LOCATION \ --schedule="$CRON_SCHEDULE" \ --http-method=GET \ --uri=$FUNCTION_URL \ --oidc-service-account-email=$SERVICE_ACCOUNT_EMAIL \ --oidc-token-audience=$FUNCTION_URL \ --time-zone="$TIME_ZONE" \ --project=$PROJECT_ID
echo "Cloud Scheduler job created: $SCHEDULER_JOB_NAME"
Congratulations, you are done orchestrating dbt Core with a Cloud Function. 💪