Soda Data Contract Verification

Soda Data Contracts provides a YAML-based contract engine that runs programmatically after data lands in the warehouse but before dbt runs, enabling pre-transformation contract verification.

Why This Layer Matters

dbt-expectations on sources and model contracts operate during the dbt run — after data has landed. For most teams, source tests catch problems early enough that downstream mart models don’t build on corrupted data.

Pre-load verification is more important in two scenarios:

Source schema changes cascade into multiple projects. If three dbt projects consume the same source table and a column disappears, all three projects need to fail and alert independently. Post-load verification catches the problem once, at the boundary, before any project runs.
The cost of bad data existing in the warehouse at all is too high. Some regulatory or operational contexts can’t tolerate even briefly having non-conforming data persisted. Post-load verification with a pipeline gate prevents downstream processes from touching data that hasn’t passed validation.

Soda Contract YAML

Soda v4 provides a purpose-built contract format with schema validation, freshness checks, missing value detection, validity rules, and duplicate detection:

dataset: datasource/db/schema/orders
checks:
  - schema:
  - row_count:
columns:
  - name: order_id
    data_type: VARCHAR
    checks:
      - type: duplicate_count
  - name: status
    data_type: VARCHAR
    checks:
      - type: invalid_count
        valid_values: ['pending', 'shipped', 'delivered', 'cancelled']

The schema check validates that the table’s actual columns match what the contract declares — if a column is missing or a new column appears, the check fails. The column-level checks validate content constraints: duplicate_count on order_id catches primary key violations, and invalid_count on status catches unexpected values.

This format is deliberately more focused than a full ODCS contract. It doesn’t cover SLAs, ownership metadata, or governance properties — it’s purely about runtime validation. If you’re using ODCS for the organizational layer, Soda’s contract verification handles the runtime execution of the quality rules embedded in that broader contract.

Integration with Pipeline Orchestration

Soda verification runs programmatically via Python, which makes it straightforward to slot into orchestration between EL and transformation:

from soda.contracts.contract import Contract, ContractResult

contract = Contract.from_file("contracts/orders.yml")
result: ContractResult = contract.verify()

if not result.is_ok():
    raise Exception(f"Contract verification failed: {result.errors}")

The orchestration pattern looks like this:

EL tool finishes → Soda contract verification → dbt build

If Soda verification fails, the pipeline stops. dbt never runs. The failed contract triggers an alert, and someone investigates the source change before any transformation touches the data.

In Airflow, Dagster, or Prefect, this is a task dependency: the Soda verification task must succeed before the dbt task starts. In simpler setups (cron-based or GitHub Actions), it’s a sequential step with an exit code check.

Soda vs. Elementary

Soda and Elementary serve different points in the pipeline, and understanding the distinction prevents both overlap and gaps:

Soda sits outside dbt. It can verify data before dbt touches it. It runs as a standalone Python process against warehouse tables, independent of any dbt project. This makes it the right tool for the post-load, pre-transformation boundary.

Elementary runs inside dbt. It’s a dbt package that executes during dbt test or dbt run. It catches changes during the transformation run — schema changes, volume anomalies, freshness anomalies — but the data has already been loaded and dbt has already started processing.

The coverage differences:

Capability	Soda Contracts	Elementary
Pre-dbt verification	Yes	No
Schema validation	Yes (contract-based)	Yes (baseline comparison)
Volume anomalies	Limited (row_count check)	Yes (statistical, Z-score)
Freshness monitoring	Yes	Yes
Column-level anomaly detection	Limited	Yes (full statistical)
dbt integration	External, via orchestration	Native dbt package
Historical baselines	No (rule-based)	Yes (training period)

Using both gives you coverage at two enforcement points: Soda catches problems at the warehouse boundary before transformation starts, and Elementary catches anomalies during transformation that rule-based checks miss. They’re complementary, not competing.

When to Adopt Soda Contracts

For most teams, dbt-expectations on sources provides sufficient post-load validation without adding another tool. The tests run during dbt build, catch structural and content drift, and integrate with your existing dbt testing workflow.

Soda contracts earn their place when:

You need a hard gate between loading and transformation — not “test during dbt” but “verify before dbt starts”
Multiple dbt projects consume the same source tables, and you want centralized verification instead of redundant source tests
You’re implementing ODCS and want a runtime engine that executes quality rules from your contract specification
Your pipeline orchestrator supports task dependencies and you want explicit pass/fail gates between pipeline stages

For teams not experiencing the post-load gap issues described above, dbt-expectations source tests are sufficient. Soda contracts become relevant when pre-dbt gates, centralized multi-project verification, or ODCS integration are needed — the next step in the layered enforcement model.

Soda Cloud vs. OSS

Soda Core (the open-source library) handles contract verification, schema checks, and basic quality rules. It runs wherever Python runs and stores no state — each verification is independent.

Soda Cloud adds historical tracking (how has this contract’s pass rate changed over time?), team collaboration features, incident management, and integrations with data catalogs and alerting systems. The pricing is usage-based.

For teams that just need the pre-dbt gate, Soda Core is sufficient. The Cloud layer becomes valuable when you want visibility into contract compliance trends across your organization — particularly useful when making the organizational case for expanding contract coverage.