ServicesAboutNotesContact Get in touch →
EN FR
Note

dbt-to-Dataform Migration Process

The step-by-step process for migrating a dbt project to Dataform — auditing what you have, running the automated tool, converting macros to JavaScript includes, recreating tests as assertions, and setting up orchestration.

Planted
dbtdataformbigquerydata engineeringdata modeling

The migration from dbt to Dataform follows a predictable sequence, but the effort is not uniformly distributed. Steps 1 and 2 — auditing and running the automated tool — are fast. Step 3 — converting macros to JavaScript includes — is where projects slow down or stall entirely. Steps 4 through 6 are mechanical but require careful execution.

Before starting, verify that migration makes sense for your project. Check whether you rely on features that have no Dataform equivalent: snapshots, heavy package usage, microbatch incremental strategy, or Slim CI. If any of these apply significantly, migration may cost more than it saves.

Step 1: Audit Your Project

Before touching any code, understand exactly what you have. Surprises during migration are expensive.

Terminal window
# Count models by type
find models -name "*.sql" | wc -l
# List macro files
find macros -name "*.sql"
# Check packages.yml for dependencies
cat packages.yml

Document:

  • Total model count — scales the overall effort estimate
  • Number of custom macros — the primary cost driver; budget 40-60% of total migration time if you have 20+ custom macros
  • External packages used — each package dependency is a conversion task; dbt_utils is manageable, dbt_expectations is significant, source-specific packages (dbt_ga4, dbt_shopify) are substantial
  • Incremental model strategies — standard updated_at > last_run is low complexity; anything involving microbatch or complex merge logic needs manual work
  • Snapshot tables — each snapshot needs a manual SCD2 implementation in Dataform
  • CI/CD setup — if you’re relying on dbt Cloud’s Slim CI, budget time to rebuild CI automation

This audit determines whether you’re looking at a 1-2 week project or a 2-3 month one. Don’t skip it.

Step 2: Run the Automated Migration Tool

The ra_dbt_to_dataform tool handles the mechanical parts of conversion:

Terminal window
# Clone the migration tool
git clone https://github.com/rittmananalytics/ra_dbt_to_dataform.git
cd ra_dbt_to_dataform
# Install dependencies
pip install -r requirements.txt
# Run migration
python migrate.py --dbt-project /path/to/dbt --output /path/to/dataform

The tool converts:

  • Model references — {{ ref('model') }} to ${ref("model")}
  • Source declarations — generating Dataform declaration files from your sources.yml
  • Common dbt_utils functions — surrogate_key, star, and a subset of date utilities
  • Basic incremental logic — {% if is_incremental() %} to ${when(incremental(), ...)}

It won’t handle:

  • Seeds (CSV files become BigQuery tables + declaration files, requiring manual work)
  • Snapshots (no equivalent; manual SCD2 implementation required)
  • Complex custom macros (the tool uses GPT-4 for macro translation — an acknowledgment that Jinja-to-JavaScript conversion is too irregular for deterministic tooling)
  • Semantic layer definitions

Treat the tool output as a starting point, not a finished product. Plan to review every converted file. The tool handles the structural conversion reliably but can produce JavaScript include files that compile but behave differently from the original Jinja. Numeric precision, null handling, and string coercion behave differently between the two templating systems — see JavaScript vs Jinja in Analytics Engineering for the specifics.

Step 3: Convert Macros to JavaScript Includes

This step dominates the migration timeline for any non-trivial project. Every custom Jinja macro becomes a JavaScript function in Dataform’s includes/ directory.

The conversion pattern is consistent for utility functions:

dbt macro (macros/generate_surrogate_key.sql):

{% macro generate_surrogate_key(field_list) %}
TO_HEX(MD5(CONCAT(
{% for field in field_list %}
COALESCE(CAST({{ field }} AS STRING), '')
{% if not loop.last %}, '|', {% endif %}
{% endfor %}
)))
{% endmacro %}

Dataform JavaScript (includes/utils.js):

function generateSurrogateKey(fields) {
const fieldExpressions = fields
.map(f => `COALESCE(CAST(${f} AS STRING), '')`)
.join(", '|', ");
return `TO_HEX(MD5(CONCAT(${fieldExpressions})))`;
}
module.exports = { generateSurrogateKey };

Usage in SQLX:

config { type: "table" }
js {
const { generateSurrogateKey } = require("includes/utils");
}
SELECT
${generateSurrogateKey(["customer_id", "order__created_at"])} AS surrogate_key,
customer_id,
order__created_at,
order__total_usd
FROM ${ref("base__source__orders")}

Simple utility functions like this translate cleanly. The painful cases:

  • Macros that call run_query() — Jinja macros can execute SQL at compile time. Dataform has no equivalent; the logic must be restructured.
  • Macros using dbt context objectsthis, target, model, graph have no direct Dataform equivalents.
  • Package macros — Any macro from dbt_utils, dbt_expectations, or other packages must be reimplemented or replaced with custom assertions.
  • JavaScript methods with no Jinja equivalent — The reverse problem: Jinja has for loops, if blocks, set, and namespace, but no .map(), .filter(), or .reduce(). If your macros use complex array or object manipulation, you rewrite algorithms, not just syntax.

Budget conservatively. A project with 20 custom macros and several package dependencies can spend most of its migration timeline on this step alone.

Step 4: Recreate Tests as Assertions

Dataform’s built-in assertions handle three cases inline in the config block:

config {
type: "table",
assertions: {
uniqueKey: ["order_id"],
nonNull: ["order_id", "customer_id", "order__created_at"],
rowConditions: [
"order__total_usd >= 0",
"order__created_at <= CURRENT_DATE()"
]
}
}

This covers uniqueness, nulls, and simple row-level conditions — equivalent to dbt’s unique, not_null, and expression_is_true tests.

For anything beyond these three types — referential integrity, regex pattern matching, distribution checks, cross-table comparisons — you need separate assertion files:

-- definitions/assertions/assert_valid_customer_emails.sqlx
config { type: "assertion" }
SELECT
customer_id,
customer__email
FROM ${ref("mrt__marketing__customers")}
WHERE customer__email NOT LIKE '%@%.%'
OR customer__email IS NULL

If the query returns rows, the assertion fails. This is Dataform’s equivalent of dbt’s singular tests. It works, but every non-standard test is a new file with handwritten SQL. There’s no parameterization or test library to draw from.

Tests from dbt_expectations that need custom assertion files include:

  • Distribution checks (mean/median within range)
  • Regex pattern validation
  • Cross-table comparisons (referential integrity)
  • Freshness checks beyond what’s in the assertions block

See Dataform Testing Limitations for the full scope of what you lose on the testing side.

Step 5: Set Up Orchestration

dbt Cloud has built-in scheduling. Dataform does not — or rather, it has basic scheduling via workflow configurations, but triggering from git events, building PR environments, and running on custom schedules requires external orchestration.

Two GCP-native options:

Cloud Composer (managed Airflow) — For complex pipelines with conditional branching, cross-system dependencies, or advanced retry logic:

from airflow.providers.google.cloud.operators.dataform import (
DataformCreateCompilationResultOperator,
DataformCreateWorkflowInvocationOperator,
)
compile_task = DataformCreateCompilationResultOperator(
task_id="compile",
project_id="my-project",
region="us-central1",
repository_id="my-repo",
)
run_task = DataformCreateWorkflowInvocationOperator(
task_id="run",
project_id="my-project",
region="us-central1",
repository_id="my-repo",
compilation_result="{{ task_instance.xcom_pull('compile') }}",
)

Composer runs $300-400/month minimum. See Cloud Composer Cost and Capabilities for when that cost is justified.

Cloud Scheduler + Cloud Workflows — For simpler scheduling needs without Composer’s overhead:

main:
steps:
- compile:
call: http.post
args:
url: https://dataform.googleapis.com/v1beta1/projects/PROJECT/locations/REGION/repositories/REPO/compilationResults
auth:
type: OAuth2
- run:
call: http.post
args:
url: https://dataform.googleapis.com/v1beta1/projects/PROJECT/locations/REGION/repositories/REPO/workflowInvocations

Cloud Workflows Orchestration costs $0.01 per 1,000 steps — essentially free for typical transformation pipelines. Cloud Scheduler handles the cron trigger. This combination covers most Dataform scheduling needs without Composer’s fixed cost.

If you need CI-like behavior (build only changed models on PRs), you’ll need GitHub Actions or Cloud Build calling the Dataform REST API. Budget additional setup time for that infrastructure.

Step 6: Parallel Run and Validate

Don’t cut over immediately. Run both systems simultaneously until you’re confident the outputs match.

  1. Deploy the Dataform project to a separate dataset (e.g., analytics_dataform)
  2. Schedule both dbt and Dataform to run on the same cadence
  3. Compare outputs using row counts and checksums

A practical validation query:

SELECT
'dbt' AS source,
COUNT(*) AS row_count,
FARM_FINGERPRINT(TO_JSON_STRING(ARRAY_AGG(t ORDER BY order_id))) AS checksum
FROM `analytics.mrt__sales__orders` t
UNION ALL
SELECT
'dataform' AS source,
COUNT(*) AS row_count,
FARM_FINGERPRINT(TO_JSON_STRING(ARRAY_AGG(t ORDER BY order_id))) AS checksum
FROM `analytics_dataform.mrt__sales__orders` t
  1. Validate downstream dashboards against both sources
  2. Monitor for 2-4 weeks before decommissioning dbt

The dbt Migration Validation Patterns note covers this step in depth — comparison query levels (row counts, aggregates, row-level EXCEPT queries), ML pipeline regression testing, and phased cutover strategy. Read that before declaring the migration complete.

The most common surprises during parallel running: incremental model state doesn’t transfer (the first Dataform run must be a full refresh on any incremental model, which can be expensive for large tables), and subtle behavioral differences in JavaScript vs Jinja arithmetic that produce slightly different results on calculated fields.