dbt BigQuery Configuration

The dbt-bigquery adapter maps dbt concepts to BigQuery resources. Understanding this mapping helps you configure dbt correctly for your multi-environment architecture.

The key translation: dbt’s project (sometimes called database in other adapters) maps to BigQuery’s project. dbt’s schema maps to BigQuery’s dataset. When you set schema: base in dbt, you’re targeting a BigQuery dataset called base.

profiles.yml for Multiple Environments

A typical BigQuery setup has separate targets for development and production:

my_analytics:
  target: dev
  outputs:
    dev:
      type: bigquery
      method: oauth  # Uses your gcloud credentials
      project: analytics-dev
      dataset: "dbt_{{ env_var('USER', 'developer') }}"
      location: US
      threads: 4

    prod:
      type: bigquery
      method: service-account
      project: analytics-prod
      dataset: analytics
      location: US
      threads: 8
      keyfile: /secrets/sa-dbt-prod.json
      job_execution_timeout_seconds: 600
      maximum_bytes_billed: 10000000000  # 10 GB limit

Several configurations matter significantly:

location must match your existing datasets. BigQuery doesn’t allow cross-region operations, so this must align with your regional architecture decisions. If your datasets are in EU, set location: EU. Getting this wrong produces cryptic errors.

execution_project (not shown above) bills query costs to a different project than where models materialize. Useful for the central data lake + department marts pattern where you want compute costs flowing to department budgets.

job_execution_timeout_seconds prevents runaway queries. BigQuery has no default timeout — queries can run indefinitely, consuming slots and potentially holding locks. Set an explicit limit based on your longest expected legitimate query. 600 seconds (10 minutes) is reasonable for most dbt models; increase to 1800 (30 minutes) for particularly heavy transformations.

maximum_bytes_billed sets a hard cap on bytes scanned per query. If a query would exceed this limit, it fails before scanning. This catches malformed queries that would scan entire tables unexpectedly — particularly valuable for incremental models that might accidentally trigger a full refresh. 10-50 GB is a reasonable range for most workflows.

Authentication Methods by Deployment Context

Different deployment contexts call for different authentication approaches:

Local development: Use OAuth via gcloud auth application-default login. Developers authenticate as themselves, and their personal identity appears in audit logs.

dbt Cloud: Upload a service account key file in connection settings. dbt Cloud manages the credential securely.

Cloud Run or GKE: Use Workload Identity. The service gets a GCP identity automatically with no key files to manage. In profiles.yml, set method: oauth and the environment provides credentials.

GitHub Actions: Store the service account JSON as a repository secret. Inject it as an environment variable during workflow execution.

Service account impersonation: The recommended pattern for sophisticated local development. Developers authenticate via OAuth but impersonate a service account with appropriate permissions:

dev:
  type: bigquery
  method: oauth
  project: analytics-dev
  dataset: dbt_dev
  impersonate_service_account: sa-dbt-dev@analytics-dev.iam.gserviceaccount.com

This avoids distributing service account keys while maintaining clear permission boundaries. The developer’s identity appears in audit logs as the actor, with the impersonated service account noted. You get the auditability of personal authentication with the permission constraints of a service account.

The generate_schema_name Macro

By default, dbt creates dataset names by combining your target schema with any custom schema: <target_schema>_<custom_schema>. If your target schema is dbt_alice and a model specifies schema: base, the model lands in dbt_alice_base.

This works for development but creates awkward production dataset names. You probably want production models in base, not analytics_base.

Override the default behavior with dbt’s built-in alternative:

-- macros/generate_schema_name.sql
{% macro generate_schema_name(custom_schema_name, node) -%}
    {{ generate_schema_name_for_env(custom_schema_name, node) }}
{%- endmacro %}

This produces dbt_alice_base in dev (target schema as prefix) but just base in production (custom schema only, no prefix). The distinction is controlled by checking whether target.name == 'prod' or similar logic inside the macro.

This is one of those macros you set up once and never think about again — but forgetting to set it up means production datasets get ugly prefixed names, or dev datasets collide with production. The dbt Project Structure and Naming note covers the broader naming conventions that this macro enables.

Job Labels for Cost Attribution

BigQuery jobs can carry labels that appear in billing exports and INFORMATION_SCHEMA. dbt can automatically apply labels identifying which model generated each query:

query-comment:
  comment: "{{ query_comment(node) }}"
  job-label: true

With this enabled, every query dbt runs carries labels like dbt_model: base__shopify__orders. This is what makes cost-per-model analysis possible.

Query INFORMATION_SCHEMA.JOBS_BY_PROJECT to analyze costs by model:

SELECT
  (SELECT value FROM UNNEST(labels) WHERE key = 'dbt_invocation_id') AS run_id,
  (SELECT value FROM UNNEST(labels) WHERE key = 'dbt_model') AS model,
  COUNT(*) AS query_count,
  SUM(total_bytes_billed) / POW(10, 9) AS total_gb_billed,
  SUM(total_slot_ms) / 1000 AS total_slot_seconds
FROM `region-US`.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND job_type = 'QUERY'
GROUP BY 1, 2
ORDER BY total_gb_billed DESC
LIMIT 50;

This query shows which models consume the most resources, enabling targeted optimization. Without job labels, cost optimization requires analyzing total spend without model attribution. With labels, the query identifies which models consume the most resources, enabling targeted optimization.

Cost Controls Checklist

Both maximum_bytes_billed and job_execution_timeout_seconds default to unlimited. Set explicit limits on every target:

prod:
  type: bigquery
  # ... other settings ...
  maximum_bytes_billed: 50000000000  # 50 GB
  job_execution_timeout_seconds: 1800  # 30 minutes

Also consider project-level quotas in the GCP console. You can set daily limits on bytes scanned, providing a backstop even if individual query limits fail.

The hierarchy of cost controls, from most to least granular:

maximum_bytes_billed per target — catches individual runaway queries
job_execution_timeout_seconds per target — catches infinite loops and hung queries
Project-level daily quotas — catches sustained overspend from many queries
Budget alerts — catches everything else, but only after the money is spent

Layer all four. Each catches failures that the others miss.