Data quality breaks into three distinct problems that operate at different points in the data lifecycle, require different tools, and catch different categories of failures. Treating “data quality” as a single concern produces either over-engineered solutions or gaps where entire categories of issues go undetected.
The Three Layers
Layer 1: Proactive Prevention (Contracts)
Data contracts act before data flows. They establish agreements between producers and consumers that prevent certain categories of problems from occurring at all.
A contract might specify that a payments event always includes amount as a positive decimal, currency as a three-letter ISO code, and customer_id as a non-null string. If a software engineer tries to deploy a change that removes the currency field or changes amount to a string, the contract enforcement blocks the deployment. The analytics pipeline never sees the breaking change.
What this layer catches:
- Schema-breaking changes at the source
- Field removals, type changes, renames
- SLA violations (data too stale, pipeline down)
- Unauthorized changes to agreed-upon data structures
What this layer misses:
- Data that’s structurally correct but semantically wrong (amount is a positive decimal, but it’s in the wrong currency)
- Gradual quality degradation (null rates creeping up within tolerance)
- Transformation bugs in your own pipeline
- Novel anomalies with no prior pattern
Tools: ODCS contracts, Schema Registry (Kafka, PubSub), Data Contract CLI, Gable.ai
Layer 2: Reactive Validation (Tests)
Schema tests and data quality checks validate data after it lands — after extraction, after transformation, after loading. They’re reactive: they detect problems that already exist rather than preventing them.
This layer has two sub-layers:
Structural validation checks whether the data’s shape is correct. Primary keys are unique. Foreign keys reference valid parents. Required fields are populated. Categorical fields contain expected values. These are the generic tests you apply to every model.
models: - name: mrt__finance__payments columns: - name: payment_id data_tests: - unique - not_null - name: customer_id data_tests: - relationships: to: ref('mrt__core__customers') field: customer_id - name: status data_tests: - accepted_values: values: ['pending', 'completed', 'failed', 'refunded']Content validation checks whether the data’s values are reasonable. Revenue is positive. Dates aren’t in the future. Email addresses match a regex. Conversion rates fall within expected ranges. These require domain knowledge to define and typically use packages like dbt-expectations:
columns: - name: amount data_tests: - dbt_expectations.expect_column_values_to_be_between: min_value: 0 strictly: true - dbt_expectations.expect_column_mean_to_be_between: min_value: 10 max_value: 10000 - name: email data_tests: - dbt_expectations.expect_column_values_to_match_regex: regex: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$" row_condition: "email IS NOT NULL"What this layer catches:
- Structural violations (duplicates, nulls, orphaned records)
- Business rule violations (invalid values, impossible combinations)
- Transformation bugs that produce incorrect output
- Cross-table inconsistencies
What this layer misses:
- Problems that haven’t been anticipated (you only test for what you think to test for)
- Gradual drift that stays within thresholds
- Upstream changes that produce valid but different data
- Problems that don’t violate any explicit rule
Tools: dbt generic tests, dbt unit tests, dbt-expectations, dbt-utils, Great Expectations, Soda
Layer 3: Anomaly Detection (Monitoring)
Anomaly detection catches what you didn’t think to test for. Instead of defining explicit rules, you train a baseline from historical data and alert when current data deviates significantly.
Elementary is the most common dbt-native tool for this layer. It uses Z-score statistics: if a metric falls more than N standard deviations from its historical mean, it fires an alert.
models: - name: mrt__finance__payments tests: - elementary.volume_anomalies: time_bucket: period: day count: 1 - elementary.freshness_anomalies columns: - name: amount tests: - elementary.column_anomalies: column_anomalies: - average - null_count - zero_count anomaly_sensitivity: 3 training_period: period: day count: 14What this layer catches:
- Volume drops or spikes (a source stops sending data, or sends 10x more than usual)
- Distribution shifts (average order value suddenly drops 40%)
- Freshness degradation (data that normally updates hourly hasn’t updated in 6 hours)
- Schema drift (new columns appearing, columns disappearing)
- Seasonal anomalies that static thresholds miss
What this layer misses:
- Slow, gradual changes that fall within the training window
- First-occurrence problems with no baseline to compare against
- Issues in newly created datasets with insufficient history
- Problems where the anomalous state becomes the new normal (the baseline adapts)
Tools: Elementary, Monte Carlo, Bigeye, Anomalo, Metaplane
Why You Need All Three
Each layer has blind spots that another layer covers:
| Scenario | Contracts | Tests | Anomaly Detection |
|---|---|---|---|
| Producer renames a column | Catches it | Catches it (after the fact) | Catches it (schema change) |
| Someone deploys a bug that produces negative revenue | Misses it (schema is valid) | Catches it (range check) | Catches it (distribution shift) |
| A source silently stops sending 30% of records | Misses it (no schema change) | Misses it (no rule violation per row) | Catches it (volume anomaly) |
| Null rate creeps from 1% to 4% over two months | Misses it (within tolerance) | Misses it (within threshold) | May catch it (depends on training window) |
| New field added upstream without notice | Catches it (contract violation) | Misses it (tests only check existing fields) | Catches it (schema change detection) |
The three-layer model maps to different organizational roles:
- Contracts require cross-team coordination (producers and consumers agree on terms)
- Tests require domain knowledge (analytics engineers define business rules)
- Anomaly detection requires initial setup but then runs autonomously
The Maturity Path
Most teams should build these layers in reverse order of the list above:
Start with tests. Add unique and not_null to every primary key. Add relationships on critical foreign keys. Install dbt-expectations for range and pattern checks. This is day-one work with immediate value.
Add anomaly detection. Install Elementary and enable volume, freshness, and column anomaly tests on your most critical models. This catches the “unknown unknowns” that explicit tests miss. Setup takes 2-5 days.
Then add contracts. Start with dbt’s native model contracts on public-facing mart models. Expand to ODCS contracts when you need to formalize agreements with teams outside your dbt project. This is the highest-effort, highest-value layer, and it requires organizational buy-in beyond the data team.
The reverse ordering — contracts are the most powerful layer but adopted last — reflects a practical constraint: contracts require cross-team coordination, which requires organizational credibility, which requires demonstrated data quality practices. Teams that start with contracts before establishing tests and monitoring often fail because they cannot demonstrate the value they are asking other teams to invest in.