dbt-audit-helper Progressive Validation

dbt-audit-helper supports a progressive validation approach: start with the cheapest checks and escalate only when they reveal problems. Each step is cheaper than the next. If an early step already shows a mismatch, there is no need to run a full row comparison.

The Sequence

Step 1: Schema Check

Use compare_relation_columns to verify that column names, ordinal positions, and data types match between your two relations. This is metadata-only — no data scanning.

{{ audit_helper.compare_relation_columns(
    a_relation=ref('mrt__finance__revenue'),
    b_relation=api.Relation.create(
      database='analytics_db',
      schema='production',
      identifier='mrt__finance__revenue'
    )
) }}

If schemas don’t match, stop here. Missing columns, type changes, or reordering will make every downstream comparison unreliable. Fix the schema first.

Step 2: Row Count

Use compare_row_counts for a fast sanity check. Mismatched counts point to filtering differences or join fan-outs.

{{ audit_helper.compare_row_counts(
    a_relation=ref('mrt__finance__revenue'),
    b_relation=api.Relation.create(
      database='analytics_db',
      schema='production',
      identifier='mrt__finance__revenue'
    )
) }}

Row count mismatches almost always trace back to WHERE clause differences, deduplication logic that behaves differently, or JOINs that produce different cardinalities.

Step 3: Quick Hash (Snowflake/BigQuery Only)

quick_are_relations_identical uses hash comparison for a fast pass/fail. If it passes, you’re done. No need for any deeper comparison.

{{ audit_helper.quick_are_relations_identical(
    a_relation=ref('mrt__finance__revenue'),
    b_relation=api.Relation.create(
      database='analytics_db',
      schema='production',
      identifier='mrt__finance__revenue'
    )
) }}

The query variant quick_are_queries_identical does the same thing with raw SQL inputs.

Step 4: Row-Level Comparison

If the hash check fails (or you’re not on Snowflake/BigQuery), compare_relations performs row-by-row validation using a UNION-based approach.

{{ audit_helper.compare_relations(
    a_relation=ref('mrt__finance__revenue'),
    b_relation=api.Relation.create(
      database='analytics_db',
      schema='production',
      identifier='mrt__finance__revenue'
    ),
    exclude_columns=["_fivetran_synced"],
    primary_key="revenue__id",
    summarize=true
) }}

With summarize=true, the output shows counts: True/True (matches), True/False (only in A), False/True (only in B). Set summarize=false to see the actual divergent rows.

Step 5: Isolate Problem Columns

compare_which_relation_columns_differ identifies which columns are causing mismatches. Run this before manually checking columns one by one.

Step 6: Deep-Dive Specific Columns

compare_column_values investigates a single column, categorizing each row as: perfect match, both null, missing from A, missing from B, null in A only, null in B only, or values do not match.

Step 7: Full Column Audit

compare_all_columns produces a per-column summary. This is the macro best suited for CI integration because it gives a complete picture in a single pass.

Step 8: Row Classification

compare_and_classify_relation_rows labels each row as identical, modified, added, or removed. The sample_limit parameter (default 20) caps the number of sample rows per classification.

Why This Order Matters

The economics are straightforward. Steps 1-3 are cheap: metadata queries, single-pass counts, and hash aggregations. Steps 4-8 involve full table scans, FULL OUTER JOINs, and UNION ALLs. On large tables, the difference between “schema check” and “row-level comparison” can be the difference between a 2-second query and a 20-minute warehouse burn.

The other benefit is diagnostic. If your schema doesn’t match, a row-level diff will be noisy and misleading. If your row counts diverge, you already know the nature of the problem (missing/extra rows) without needing to inspect individual values. Each step narrows the search space for the next.

Applying the Workflow

For Refactoring

You’re rewriting a CTE, changing a join strategy, or restructuring a model. Point ref() at your dev schema and api.Relation.create at production. Run the progressive sequence. Any differences surface immediately against a known-good baseline.

For SQL-to-dbt Migration

Load your existing tables into the same warehouse, build dbt equivalents, and compare systematically. A practical tip from the Indicium Tech team: compare 5 columns at a time, confirm they match, then expand. This keeps the feedback loop tight. Trying to debug all mismatches simultaneously usually leads to confusion about which change fixed what.

This approach extends the manual comparison strategies in dbt Migration Validation Patterns with automated, repeatable macros instead of ad-hoc EXCEPT queries.

For Large-Scale ETL Migrations

For enterprise-scale migrations (Informatica to dbt, legacy ETL to dbt), the dbt-audit-helper-ext package by Infinite Lambda extends audit-helper with a validation log table, codegen scripts for auto-generating validation macros, and threshold-based acceptance: 100% match = pass, 99%+ = warning, below 99% = fail.