Data Quality Validation Layers

Data quality breaks into three distinct problems that operate at different points in the data lifecycle, require different tools, and catch different categories of failures. Treating “data quality” as a single concern produces either over-engineered solutions or gaps where entire categories of issues go undetected.

The Three Layers

Layer 1: Proactive Prevention (Contracts)

Data contracts act before data flows. They establish agreements between producers and consumers that prevent certain categories of problems from occurring at all.

A contract might specify that a payments event always includes amount as a positive decimal, currency as a three-letter ISO code, and customer_id as a non-null string. If a software engineer tries to deploy a change that removes the currency field or changes amount to a string, the contract enforcement blocks the deployment. The analytics pipeline never sees the breaking change.

What this layer catches:

Schema-breaking changes at the source
Field removals, type changes, renames
SLA violations (data too stale, pipeline down)
Unauthorized changes to agreed-upon data structures

What this layer misses:

Data that’s structurally correct but semantically wrong (amount is a positive decimal, but it’s in the wrong currency)
Gradual quality degradation (null rates creeping up within tolerance)
Transformation bugs in your own pipeline
Novel anomalies with no prior pattern

Tools: ODCS contracts, Schema Registry (Kafka, PubSub), Data Contract CLI, Gable.ai

Layer 2: Reactive Validation (Tests)

Schema tests and data quality checks validate data after it lands — after extraction, after transformation, after loading. They’re reactive: they detect problems that already exist rather than preventing them.

This layer has two sub-layers:

Structural validation checks whether the data’s shape is correct. Primary keys are unique. Foreign keys reference valid parents. Required fields are populated. Categorical fields contain expected values. These are the generic tests you apply to every model.

models:
  - name: mrt__finance__payments
    columns:
      - name: payment_id
        data_tests:
          - unique
          - not_null
      - name: customer_id
        data_tests:
          - relationships:
              to: ref('mrt__core__customers')
              field: customer_id
      - name: status
        data_tests:
          - accepted_values:
              values: ['pending', 'completed', 'failed', 'refunded']

Content validation checks whether the data’s values are reasonable. Revenue is positive. Dates aren’t in the future. Email addresses match a regex. Conversion rates fall within expected ranges. These require domain knowledge to define and typically use packages like dbt-expectations:

columns:
  - name: amount
    data_tests:
      - dbt_expectations.expect_column_values_to_be_between:
          min_value: 0
          strictly: true
      - dbt_expectations.expect_column_mean_to_be_between:
          min_value: 10
          max_value: 10000
  - name: email
    data_tests:
      - dbt_expectations.expect_column_values_to_match_regex:
          regex: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
          row_condition: "email IS NOT NULL"

What this layer catches:

Structural violations (duplicates, nulls, orphaned records)
Business rule violations (invalid values, impossible combinations)
Transformation bugs that produce incorrect output
Cross-table inconsistencies

What this layer misses:

Problems that haven’t been anticipated (you only test for what you think to test for)
Gradual drift that stays within thresholds
Upstream changes that produce valid but different data
Problems that don’t violate any explicit rule

Tools: dbt generic tests, dbt unit tests, dbt-expectations, dbt-utils, Great Expectations, Soda

Layer 3: Anomaly Detection (Monitoring)

Anomaly detection catches what you didn’t think to test for. Instead of defining explicit rules, you train a baseline from historical data and alert when current data deviates significantly.

Elementary is the most common dbt-native tool for this layer. It uses Z-score statistics: if a metric falls more than N standard deviations from its historical mean, it fires an alert.

models:
  - name: mrt__finance__payments
    tests:
      - elementary.volume_anomalies:
          time_bucket:
            period: day
            count: 1
      - elementary.freshness_anomalies
    columns:
      - name: amount
        tests:
          - elementary.column_anomalies:
              column_anomalies:
                - average
                - null_count
                - zero_count
              anomaly_sensitivity: 3
              training_period:
                period: day
                count: 14

What this layer catches:

Volume drops or spikes (a source stops sending data, or sends 10x more than usual)
Distribution shifts (average order value suddenly drops 40%)
Freshness degradation (data that normally updates hourly hasn’t updated in 6 hours)
Schema drift (new columns appearing, columns disappearing)
Seasonal anomalies that static thresholds miss

What this layer misses:

Slow, gradual changes that fall within the training window
First-occurrence problems with no baseline to compare against
Issues in newly created datasets with insufficient history
Problems where the anomalous state becomes the new normal (the baseline adapts)

Tools: Elementary, Monte Carlo, Bigeye, Anomalo, Metaplane

Why You Need All Three

Each layer has blind spots that another layer covers:

Scenario	Contracts	Tests	Anomaly Detection
Producer renames a column	Catches it	Catches it (after the fact)	Catches it (schema change)
Someone deploys a bug that produces negative revenue	Misses it (schema is valid)	Catches it (range check)	Catches it (distribution shift)
A source silently stops sending 30% of records	Misses it (no schema change)	Misses it (no rule violation per row)	Catches it (volume anomaly)
Null rate creeps from 1% to 4% over two months	Misses it (within tolerance)	Misses it (within threshold)	May catch it (depends on training window)
New field added upstream without notice	Catches it (contract violation)	Misses it (tests only check existing fields)	Catches it (schema change detection)

The three-layer model maps to different organizational roles:

Contracts require cross-team coordination (producers and consumers agree on terms)
Tests require domain knowledge (analytics engineers define business rules)
Anomaly detection requires initial setup but then runs autonomously

The Maturity Path

Most teams should build these layers in reverse order of the list above:

Start with tests. Add unique and not_null to every primary key. Add relationships on critical foreign keys. Install dbt-expectations for range and pattern checks. This is day-one work with immediate value.

Add anomaly detection. Install Elementary and enable volume, freshness, and column anomaly tests on your most critical models. This catches the “unknown unknowns” that explicit tests miss. Setup takes 2-5 days.

Then add contracts. Start with dbt’s native model contracts on public-facing mart models. Expand to ODCS contracts when you need to formalize agreements with teams outside your dbt project. This is the highest-effort, highest-value layer, and it requires organizational buy-in beyond the data team.

The reverse ordering — contracts are the most powerful layer but adopted last — reflects a practical constraint: contracts require cross-team coordination, which requires organizational credibility, which requires demonstrated data quality practices. Teams that start with contracts before establishing tests and monitoring often fail because they cannot demonstrate the value they are asking other teams to invest in.