A data contract is a formal, versioned agreement between data producers and consumers that defines the expected structure, semantics, quality standards, and delivery guarantees of a dataset. The concept was originated by Andrew Jones at GoCardless in 2020 and has since matured into a Linux Foundation-backed specification.
Jones’s framing centers on the idea of databases being treated as “non-consensual APIs.” ELT tools connect to production databases to extract data for analytics, but the engineers who own those databases never agreed to provide data in that format, at that frequency, or with any guarantees about stability. A software engineer optimizing a production table has no reason to think about the analytics pipeline downstream — not until a contract makes that dependency explicit.
The Core Question
In practice, a data contract answers one question: what can I depend on about this data?
That question covers several dimensions:
- Structure. Which columns exist, what their types are, and whether the schema is stable.
- Semantics. What each field means. Is
revenuegross or net? Doescreated_atuse UTC or local time? - Quality. Acceptable null rates, value ranges, freshness guarantees.
- Delivery. How often the data updates. What the SLA is for availability.
- Ownership. Who is responsible when the data doesn’t meet the agreement.
Without contracts, the answers to these questions are implicit. They exist as tribal knowledge, buried in Slack threads, or — most commonly — nowhere. The analytics engineer discovers the answers through failure: a model breaks, and the investigation reveals that an upstream team changed something three days ago without telling anyone.
How Contracts Differ from What You Already Have
If you’re using dbt, you likely have schema tests, data quality checks, and possibly model contracts. Data contracts as a broader concept are not a replacement for any of these — they’re an additional layer that operates at a different point in the data lifecycle.
Schema tests (dbt generic tests like unique, not_null, accepted_values) validate column properties after a model builds. They’re reactive and structural: you find out about the problem after the bad data has already landed in your warehouse.
Data quality checks (null rate monitoring, value range validation, distribution analysis via tools like Elementary or dbt-expectations) are also reactive but cover content rather than structure. They tell you something is wrong with the data itself, not just the schema.
Data contracts are proactive. They establish agreements before data flows, so that violations are caught or prevented at the source rather than discovered downstream. A contract might specify that a payments event will always include amount, currency, and customer_id fields, with amount as a positive decimal and currency as a three-letter ISO code. If a software engineer tries to deploy a change that removes the currency field, the contract enforcement blocks the deployment — the analytics pipeline never sees the breaking change.
As Soda’s documentation puts it: “dbt contracts are best understood as schema contracts for transformations. They protect downstream models within a dbt DAG, providing valuable local safety. But they are not designed to function as full data contracts across the data lifecycle.”
The Three Layers Together
You need all three layers — they aren’t competing approaches:
| Layer | When it acts | What it catches | Example |
|---|---|---|---|
| Data contracts | Before data flows | Breaking changes at the source | Producer renames a column; contract blocks deployment |
| Schema tests | After model builds | Structural violations in transformed data | Primary key has duplicates after a join |
| Data quality checks | After model builds | Content anomalies in transformed data | Revenue drops 40% versus historical baseline |
Contracts prevent certain categories of problems from occurring in the first place. Tests catch what contracts don’t cover and what slips through despite contracts. The combination is what gives you genuine confidence in your data.
A Minimal Contract Example
Using the Open Data Contract Standard (ODCS) format:
apiVersion: v3.1.0kind: DataContractid: 53581432-6c55-4ba2-a65f-72344a91553aname: seller_payments_v1version: 1.1.0status: activedomain: sellerdataProduct: paymentsdescription: purpose: Views built on top of the seller tables.If you’ve written dbt YAML, this format feels familiar. The full specification covers more than dbt contracts handle — SLAs, ownership, pricing, governance metadata — which explains both the power and the added complexity of adopting the full standard versus dbt’s focused implementation.
The Origin Story
Jones conceived the idea at GoCardless, a direct debit payments processor, where schema changes in one service would silently break consumers of that data. He published the first description in April 2021 and a detailed engineering post in December 2021. Within six months of implementation, GoCardless had deployed around 30 contracts covering 50-60% of their async communication events.
Chad Sanderson, then Head of Data at Convoy, catalyzed broader adoption with his August 2022 post “The Rise of Data Contracts,” framing the problem as the “GIGO Cycle”: producers have no idea how their data is used downstream, engineers lack incentive to maintain data quality beyond operational needs, and data engineers become firefighting intermediaries.
By 2025, two books had been published (Jones’s Driving Data Quality with Data Contracts and Sanderson, Freeman, and Schmidt’s Data Contracts), dbt had shipped native contract enforcement, and the specifications had started to consolidate under the Linux Foundation.
The concept progressed from fringe experimentation (2021-2022) through mainstream discussion (2023-2024) to early-majority adoption (2025-2026). Tooling and standards are mature; the remaining challenge is organizational, not technical.