Semantic Validation in dbt

Structural tests verify that your data has the right shape — primary keys are unique, foreign keys reference valid parents, required fields aren’t null. Anomaly detection catches statistical deviations from historical patterns. But between these two layers sits a gap: business rules that are domain-specific, deterministic, and invisible to both structural checks and statistical methods.

A status column that contains "active" where the business expects "Active" passes every structural test. An email column filled with "test@test.com" passes not_null. A contract with a signed_date in the year 2087 passes every type check. These are semantic violations — the data is structurally valid but meaningfully wrong.

Regex Pattern Validation

The most common semantic validation encodes format constraints as regular expressions. dbt-expectations provides expect_column_values_to_match_regex for this purpose.

columns:
  - name: customer__email
    data_tests:
      - dbt_expectations.expect_column_values_to_match_regex:
          regex: "^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$"
          row_condition: "customer__email IS NOT NULL"
  - name: product__sku
    data_tests:
      - dbt_expectations.expect_column_values_to_match_regex:
          regex: "^[A-Z]{2}-[0-9]{4}-[A-Z]{3}$"
  - name: customer__phone
    data_tests:
      - dbt_expectations.expect_column_values_to_match_regex:
          regex: "^\\+?[1-9]\\d{1,14}$"

The row_condition parameter is essential for nullable columns. Without it, NULL values are tested against the regex and fail, which produces false positives if NULLs are acceptable (the field is optional) but malformed values are not.

Regex tests are deterministic, fast, and easy to reason about. They work well for:

Email formats where you want to catch obviously invalid entries like "n/a" or "none" without building a full RFC 5322 parser
Product codes and SKUs with known structures like XX-1234-ABC
Phone numbers in E.164 or regional formats
Postal codes with country-specific patterns
Currency codes (three uppercase letters matching ISO 4217)

The limitation is that regex validates format, not content. An email like nobody@example.com passes the regex but isn’t a real person’s address. For content-level validation, you need different tools.

Range and Distribution Checks

Business rules often constrain values to expected ranges that aren’t captured by data types alone. A NUMERIC column for revenue allows negative values that the business considers invalid. A FLOAT column for conversion rates should never exceed 1.0 but the type doesn’t enforce that.

columns:
  - name: order__revenue
    data_tests:
      - dbt_expectations.expect_column_values_to_be_between:
          min_value: 0
          strictly: true
      - dbt_expectations.expect_column_mean_to_be_between:
          min_value: 10
          max_value: 10000
  - name: conversion_rate
    data_tests:
      - dbt_expectations.expect_column_values_to_be_between:
          min_value: 0
          max_value: 1
  - name: event__timestamp
    data_tests:
      - dbt_expectations.expect_column_values_to_be_between:
          min_value: "2020-01-01"
          max_value: "{{ dbt.dateadd('day', 1, dbt.current_timestamp()) }}"

The distinction between row-level range checks (expect_column_values_to_be_between) and aggregate checks (expect_column_mean_to_be_between) matters. Row-level checks catch individual bad values. Aggregate checks catch distribution shifts where every individual value might be valid but the overall pattern is wrong. A revenue column where every value is between $0 and $100,000 can still have a problem if the average drops from $500 to $50.

Cross-Column Validation

Some business rules span multiple columns. An end_date must be after start_date. A discount_amount can’t exceed subtotal. A shipping_status of "delivered" requires a non-null delivery_date.

dbt-utils provides expression_is_true for arbitrary SQL assertions:

models:
  - name: mrt__sales__orders
    data_tests:
      - dbt_utils.expression_is_true:
          expression: "end_date >= start_date"
          config:
            where: "end_date IS NOT NULL AND start_date IS NOT NULL"
      - dbt_utils.expression_is_true:
          expression: "discount_amount <= subtotal"
      - dbt_utils.expression_is_true:
          expression: >
            (shipping_status != 'delivered')
            OR (delivery_date IS NOT NULL)

For complex multi-table assertions or rules that don’t fit neatly into a single SQL expression, singular tests (standalone SQL files in the tests/ directory) provide full flexibility. The query returns rows that violate the rule; zero rows means a pass.

AI-Powered Validation

Elementary introduced ai_data_validation tests that use natural language prompts to express business rules:

tests:
  - elementary.ai_data_validation:
      prompt: "There should be no contract date in the future"
  - elementary.ai_data_validation:
      prompt: "Every order with status 'shipped' should have a tracking number"
  - elementary.ai_data_validation:
      prompt: "Customer names should not contain numeric characters"

The test sends data samples to an LLM with the natural language prompt and interprets the response as pass or fail. This is genuinely useful for business rules that are hard to express in SQL — rules that involve fuzzy matching, context-dependent logic, or domain knowledge that an LLM can approximate.

The trade-offs are significant:

Non-deterministic. The same data might produce different results on different runs, depending on the LLM’s interpretation. This violates the fundamental expectation that tests are deterministic.
Cost. Each test execution incurs LLM API costs, which scale with data volume and test count.
Latency. LLM calls are orders of magnitude slower than SQL-based validation.
Opaque reasoning. When the test fails, you can’t inspect the logic the way you can with a regex or SQL expression.

AI validation is best treated as a complement to deterministic tests, not a replacement. Use it for exploratory validation — discovering rules that should eventually be encoded as SQL — or for genuinely fuzzy checks where deterministic encoding is impractical. Don’t use it as a substitute for a regex when a regex would work.

Building a Semantic Test Strategy

The practical approach to semantic validation follows a priority order:

First, structural semantics. Primary keys, foreign keys, not-null constraints, accepted values. These are the generic tests that every model should have. They catch the most common violations with the least configuration effort.

Second, format semantics. Regex patterns for columns with known structures. Apply these to any column where invalid formats have caused downstream issues or where the source is known to produce malformed data.

Third, range semantics. Value bounds and distribution checks for numeric and date columns. Focus on columns that feed metrics, reports, or ML models where incorrect values have measurable business impact.

Fourth, relational semantics. Cross-column rules and multi-table assertions for business logic that spans fields. These require domain knowledge to define and are typically informed by past incidents.

Fifth, exploratory semantics. AI-powered validation for rules that are hard to formalize or for initial discovery of data quality patterns. Graduate validated rules to deterministic SQL tests once the pattern is confirmed.

Targeted coverage — focused on columns where a semantic violation has measurable downstream impact — is more practical than exhaustive coverage. A customer__email column feeding a marketing automation system warrants regex validation; a notes free-text column with no structured downstream use generally does not.