ServicesAboutNotesContact Get in touch →
EN FR
Note

Soda Data Contract Verification

How Soda's contract engine validates schema, freshness, and quality rules against warehouse tables after loading but before transformation — filling the gap between EL and dbt.

Planted
dbtdata qualitydata engineeringtesting

Soda Data Contracts provides a YAML-based contract engine that runs programmatically after data lands in the warehouse but before dbt runs, enabling pre-transformation contract verification.

Why This Layer Matters

dbt-expectations on sources and model contracts operate during the dbt run — after data has landed. For most teams, source tests catch problems early enough that downstream mart models don’t build on corrupted data.

Pre-load verification is more important in two scenarios:

  1. Source schema changes cascade into multiple projects. If three dbt projects consume the same source table and a column disappears, all three projects need to fail and alert independently. Post-load verification catches the problem once, at the boundary, before any project runs.

  2. The cost of bad data existing in the warehouse at all is too high. Some regulatory or operational contexts can’t tolerate even briefly having non-conforming data persisted. Post-load verification with a pipeline gate prevents downstream processes from touching data that hasn’t passed validation.

Soda Contract YAML

Soda v4 provides a purpose-built contract format with schema validation, freshness checks, missing value detection, validity rules, and duplicate detection:

dataset: datasource/db/schema/orders
checks:
- schema:
- row_count:
columns:
- name: order_id
data_type: VARCHAR
checks:
- type: duplicate_count
- name: status
data_type: VARCHAR
checks:
- type: invalid_count
valid_values: ['pending', 'shipped', 'delivered', 'cancelled']

The schema check validates that the table’s actual columns match what the contract declares — if a column is missing or a new column appears, the check fails. The column-level checks validate content constraints: duplicate_count on order_id catches primary key violations, and invalid_count on status catches unexpected values.

This format is deliberately more focused than a full ODCS contract. It doesn’t cover SLAs, ownership metadata, or governance properties — it’s purely about runtime validation. If you’re using ODCS for the organizational layer, Soda’s contract verification handles the runtime execution of the quality rules embedded in that broader contract.

Integration with Pipeline Orchestration

Soda verification runs programmatically via Python, which makes it straightforward to slot into orchestration between EL and transformation:

from soda.contracts.contract import Contract, ContractResult
contract = Contract.from_file("contracts/orders.yml")
result: ContractResult = contract.verify()
if not result.is_ok():
raise Exception(f"Contract verification failed: {result.errors}")

The orchestration pattern looks like this:

EL tool finishes → Soda contract verification → dbt build

If Soda verification fails, the pipeline stops. dbt never runs. The failed contract triggers an alert, and someone investigates the source change before any transformation touches the data.

In Airflow, Dagster, or Prefect, this is a task dependency: the Soda verification task must succeed before the dbt task starts. In simpler setups (cron-based or GitHub Actions), it’s a sequential step with an exit code check.

Soda vs. Elementary

Soda and Elementary serve different points in the pipeline, and understanding the distinction prevents both overlap and gaps:

Soda sits outside dbt. It can verify data before dbt touches it. It runs as a standalone Python process against warehouse tables, independent of any dbt project. This makes it the right tool for the post-load, pre-transformation boundary.

Elementary runs inside dbt. It’s a dbt package that executes during dbt test or dbt run. It catches changes during the transformation run — schema changes, volume anomalies, freshness anomalies — but the data has already been loaded and dbt has already started processing.

The coverage differences:

CapabilitySoda ContractsElementary
Pre-dbt verificationYesNo
Schema validationYes (contract-based)Yes (baseline comparison)
Volume anomaliesLimited (row_count check)Yes (statistical, Z-score)
Freshness monitoringYesYes
Column-level anomaly detectionLimitedYes (full statistical)
dbt integrationExternal, via orchestrationNative dbt package
Historical baselinesNo (rule-based)Yes (training period)

Using both gives you coverage at two enforcement points: Soda catches problems at the warehouse boundary before transformation starts, and Elementary catches anomalies during transformation that rule-based checks miss. They’re complementary, not competing.

When to Adopt Soda Contracts

For most teams, dbt-expectations on sources provides sufficient post-load validation without adding another tool. The tests run during dbt build, catch structural and content drift, and integrate with your existing dbt testing workflow.

Soda contracts earn their place when:

  • You need a hard gate between loading and transformation — not “test during dbt” but “verify before dbt starts”
  • Multiple dbt projects consume the same source tables, and you want centralized verification instead of redundant source tests
  • You’re implementing ODCS and want a runtime engine that executes quality rules from your contract specification
  • Your pipeline orchestrator supports task dependencies and you want explicit pass/fail gates between pipeline stages

For teams not experiencing the post-load gap issues described above, dbt-expectations source tests are sufficient. Soda contracts become relevant when pre-dbt gates, centralized multi-project verification, or ODCS integration are needed — the next step in the layered enforcement model.

Soda Cloud vs. OSS

Soda Core (the open-source library) handles contract verification, schema checks, and basic quality rules. It runs wherever Python runs and stores no state — each verification is independent.

Soda Cloud adds historical tracking (how has this contract’s pass rate changed over time?), team collaboration features, incident management, and integrations with data catalogs and alerting systems. The pricing is usage-based.

For teams that just need the pre-dbt gate, Soda Core is sufficient. The Cloud layer becomes valuable when you want visibility into contract compliance trends across your organization — particularly useful when making the organizational case for expanding contract coverage.