ServicesAboutNotesContact Get in touch →
EN FR
Note

dbt Source Schema Validation

How to validate source schema in dbt when contracts can't reach — using dbt-expectations on sources to catch column drift before transformation runs.

Planted
dbtdata qualitytesting

dbt model contracts are explicit about their scope: they don’t apply to sources. Jeremy Cohen was direct about this when contracts shipped: “I’m also not envisioning the application of contracts to data sources.” Contracts run at compile time by comparing your SQL’s output against a YAML declaration — but sources are external tables outside dbt’s build process entirely.

This creates a real gap. The upstream data that feeds your models can change its schema, and nothing in dbt’s native contract system will catch it.

The Gap Contracts Leave Open

The standard workaround — placing a contracted base model directly on top of a source — does close part of the gap. If Fivetran renames a column in the source, the base model’s contract fails at compile time, and the build stops. You find out early.

But it only catches structural issues, and only at transformation time. The data has already landed in your warehouse before anything runs. More importantly: if a source column that used to contain five distinct values now contains fifteen, the contract on the base model won’t notice. The column name is right, the type is right, the contract is satisfied — but the semantics have changed, and that difference will propagate through your transformations silently.

dbt-expectations on Sources

dbt-expectations fills this gap. The package’s schema-level tests work on sources, giving you structural validation that contracts can’t provide there:

sources:
- name: raw_data
tables:
- name: raw_orders
data_tests:
- dbt_expectations.expect_table_columns_to_match_set:
column_list: ["order_id", "customer_id", "amount", "status"]
- dbt_expectations.expect_table_column_count_to_equal:
value: 4
columns:
- name: status
data_tests:
- dbt_expectations.expect_column_values_to_be_in_set:
values: ['pending', 'shipped', 'delivered', 'cancelled']
- name: amount
data_tests:
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
max_value: 1000000

expect_table_columns_to_match_set catches added or removed columns. expect_table_column_count_to_equal catches unexpected structural changes even when you haven’t enumerated the column names. The column-level tests validate content constraints that contracts can’t express even on models.

This is specifically valuable for sources because:

  • Schema changes in sources happen outside your control (the EL tool does what the API sends)
  • You often don’t own the upstream system and can’t add contracts there
  • Problems caught at the source surface before any transformation runs, which is earlier and cheaper

Where to Put Source Tests

There’s a real tradeoff between testing at the source versus testing on the base model. Neither option is obviously correct.

Testing on sources directly:

  • Problems surface before any transformation runs
  • Clear signal: this is a source issue, not a transformation bug
  • Tests run against raw tables that may be large and expensive to scan

Testing on base models:

  • Lets you combine structural checks (via contracts) with content checks in one place
  • Base models often filter or deduplicate, making queries cheaper
  • Downstream from any EL transformation that might reformat source data

A practical split that works well: put the cheap structural tests on sources and the more expensive content tests on base models.

On sources:

  • expect_table_columns_to_match_set — this is cheap because it’s a metadata check against INFORMATION_SCHEMA, not a full table scan
  • expect_table_column_count_to_equal — same, metadata-only

On base models:

  • expect_column_values_to_be_in_set — a SQL query across all rows; cheaper on a filtered base model than raw source
  • expect_column_values_to_be_between — same reasoning; filtered data scans fewer rows

The metadata checks are legitimately cheap to run directly on sources. They’ll tell you about structural drift without touching the data. The content checks hit every row, so they belong where the data is already filtered.

The Contracted Base Model Pattern

Combining the contracted base model with source-level expectations gives you two layers of protection with different detection points:

# Source: schema-level expectations catch drift early
sources:
- name: raw_data
tables:
- name: raw_orders
data_tests:
- dbt_expectations.expect_table_columns_to_match_set:
column_list: ["order_id", "customer_id", "amount", "status"]
# Base model: contract catches structural mismatch at compile time
models:
- name: base__orders
config:
contract:
enforced: true
columns:
- name: order_id
data_type: string
- name: customer_id
data_type: string
- name: amount
data_type: numeric
- name: status
data_type: string
data_tests:
- dbt_expectations.expect_column_values_to_be_in_set:
column_name: status
values: ['pending', 'shipped', 'delivered', 'cancelled']

The source test runs after the EL job completes, before your dbt run starts — catching new columns or removed columns before any SQL fires. The base model contract catches the same structural problems at compile time (the model doesn’t build). The base model content tests catch semantic drift that neither structural check can see.

This matters most when your source schema changes frequently. If you’re ingesting from a volatile API or an upstream team that ships changes without notice, this two-layer approach gives you the earliest possible signal — which means the fewest wasted pipeline runs and the least confusion about what broke and where.

What This Still Doesn’t Cover

Source validation with dbt-expectations runs post-load, after data has already arrived in your warehouse. It’s not a pre-ingestion check. If you need to validate before loading, that belongs in your EL layer — dlt has native schema contracts that can enforce structure before data lands.

For anomalies in source data that aren’t schema problems — volume drops, distribution shifts, unexpected null rates — you’re looking at Elementary’s anomaly detection or similar tools. Schema tests catch structural changes; they don’t catch “the data looks structurally fine but 30% of records are missing.”