Data Contract Definition

A data contract is a formal, versioned agreement between data producers and consumers that defines the expected structure, semantics, quality standards, and delivery guarantees of a dataset. The concept was originated by Andrew Jones at GoCardless in 2020 and has since matured into a Linux Foundation-backed specification.

Jones’s framing centers on the idea of databases being treated as “non-consensual APIs.” ELT tools connect to production databases to extract data for analytics, but the engineers who own those databases never agreed to provide data in that format, at that frequency, or with any guarantees about stability. A software engineer optimizing a production table has no reason to think about the analytics pipeline downstream — not until a contract makes that dependency explicit.

The Core Question

In practice, a data contract answers one question: what can I depend on about this data?

That question covers several dimensions:

Structure. Which columns exist, what their types are, and whether the schema is stable.
Semantics. What each field means. Is revenue gross or net? Does created_at use UTC or local time?
Quality. Acceptable null rates, value ranges, freshness guarantees.
Delivery. How often the data updates. What the SLA is for availability.
Ownership. Who is responsible when the data doesn’t meet the agreement.

Without contracts, the answers to these questions are implicit. They exist as tribal knowledge, buried in Slack threads, or — most commonly — nowhere. The analytics engineer discovers the answers through failure: a model breaks, and the investigation reveals that an upstream team changed something three days ago without telling anyone.

How Contracts Differ from What You Already Have

If you’re using dbt, you likely have schema tests, data quality checks, and possibly model contracts. Data contracts as a broader concept are not a replacement for any of these — they’re an additional layer that operates at a different point in the data lifecycle.

Schema tests (dbt generic tests like unique, not_null, accepted_values) validate column properties after a model builds. They’re reactive and structural: you find out about the problem after the bad data has already landed in your warehouse.

Data quality checks (null rate monitoring, value range validation, distribution analysis via tools like Elementary or dbt-expectations) are also reactive but cover content rather than structure. They tell you something is wrong with the data itself, not just the schema.

Data contracts are proactive. They establish agreements before data flows, so that violations are caught or prevented at the source rather than discovered downstream. A contract might specify that a payments event will always include amount, currency, and customer_id fields, with amount as a positive decimal and currency as a three-letter ISO code. If a software engineer tries to deploy a change that removes the currency field, the contract enforcement blocks the deployment — the analytics pipeline never sees the breaking change.

As Soda’s documentation puts it: “dbt contracts are best understood as schema contracts for transformations. They protect downstream models within a dbt DAG, providing valuable local safety. But they are not designed to function as full data contracts across the data lifecycle.”

The Three Layers Together

You need all three layers — they aren’t competing approaches:

Layer	When it acts	What it catches	Example
Data contracts	Before data flows	Breaking changes at the source	Producer renames a column; contract blocks deployment
Schema tests	After model builds	Structural violations in transformed data	Primary key has duplicates after a join
Data quality checks	After model builds	Content anomalies in transformed data	Revenue drops 40% versus historical baseline

Contracts prevent certain categories of problems from occurring in the first place. Tests catch what contracts don’t cover and what slips through despite contracts. The combination is what gives you genuine confidence in your data.

A Minimal Contract Example

Using the Open Data Contract Standard (ODCS) format:

apiVersion: v3.1.0
kind: DataContract
id: 53581432-6c55-4ba2-a65f-72344a91553a
name: seller_payments_v1
version: 1.1.0
status: active
domain: seller
dataProduct: payments
description:
  purpose: Views built on top of the seller tables.

If you’ve written dbt YAML, this format feels familiar. The full specification covers more than dbt contracts handle — SLAs, ownership, pricing, governance metadata — which explains both the power and the added complexity of adopting the full standard versus dbt’s focused implementation.

The Origin Story

Jones conceived the idea at GoCardless, a direct debit payments processor, where schema changes in one service would silently break consumers of that data. He published the first description in April 2021 and a detailed engineering post in December 2021. Within six months of implementation, GoCardless had deployed around 30 contracts covering 50-60% of their async communication events.

Chad Sanderson, then Head of Data at Convoy, catalyzed broader adoption with his August 2022 post “The Rise of Data Contracts,” framing the problem as the “GIGO Cycle”: producers have no idea how their data is used downstream, engineers lack incentive to maintain data quality beyond operational needs, and data engineers become firefighting intermediaries.

By 2025, two books had been published (Jones’s Driving Data Quality with Data Contracts and Sanderson, Freeman, and Schmidt’s Data Contracts), dbt had shipped native contract enforcement, and the specifications had started to consolidate under the Linux Foundation.

The concept progressed from fringe experimentation (2021-2022) through mainstream discussion (2023-2024) to early-majority adoption (2025-2026). Tooling and standards are mature; the remaining challenge is organizational, not technical.