dbt observe-fix remediation pattern

The Observe & Fix pattern, described by Baruch Jacob, embeds data quality remediation directly in the dbt DAG. The base layer detects problems and exposes them through tests. Intermediate and mart layers apply the fixes. Remediation logic is explicit, testable, and version-controlled — distinct from retry mechanisms in the orchestrator or external tools.

The core idea

Separate detection from remediation. Base models and their tests catch the problem. Downstream models apply the fix. This keeps each layer’s responsibility clear: base models are a faithful representation of the source (with problems included), and downstream models are where you decide what to do about those problems.

The alternative, fixing data quality issues in the base model, obscures what the source actually looks like. If your base model silently drops null records, you lose visibility into how often the source sends nulls. If you coalesce bad values in the base layer, downstream models can’t distinguish “the source sent a valid value” from “the source sent garbage and we patched it.” The observe-fix pattern preserves that distinction.

Common remediation techniques

Deduplication

Source systems sometimes send duplicate records. CRM systems with replication lag, event streams with at-least-once delivery, API responses that include overlapping time windows. Rather than deduplicating in the base model (which hides the problem), let the base model pass through all records, add a test that detects duplicates, and deduplicate in the intermediate layer.

The base model keeps all records:

-- base__crm__contacts.sql
SELECT
    id AS contact__id,
    email AS contact__email,
    updated_at AS contact__updated_at,
    _loaded_at
FROM {{ source('crm', 'contacts') }}

A test detects duplicates in the base model:

models:
  - name: base__crm__contacts
    columns:
      - name: contact__id
        data_tests:
          - not_null
    data_tests:
      - dbt_utils.recency:
          datepart: day
          field: _loaded_at
          interval: 2

The intermediate model deduplicates:

-- int__crm__contacts_deduplicated.sql
SELECT *
FROM {{ ref('base__crm__contacts') }}
QUALIFY ROW_NUMBER() OVER (
    PARTITION BY contact__id
    ORDER BY contact__updated_at DESC
) = 1

The test on the base model fires when duplicates appear, giving you visibility. The intermediate model handles them gracefully regardless. If the source fixes its duplication problem, the intermediate model still works correctly (the QUALIFY clause is a no-op when there are no duplicates).

Null coalescing

Some columns arrive null when they shouldn’t. Rather than filtering these records out at the base layer (losing data) or failing the pipeline (blocking downstream models), coalesce with sensible defaults in the intermediate or mart layer.

-- int__orders__enriched.sql
SELECT
    order__id,
    COALESCE(order__currency, 'USD') AS order__currency,
    COALESCE(order__status, 'unknown') AS order__status,
    order__total,
    order__created_at
FROM {{ ref('base__payments__orders') }}

The choice of default matters. 'unknown' for a status field is honest: it tells downstream consumers “we don’t know.” Coalescing to a specific valid value like 'pending' is dangerous because it introduces data that looks real but isn’t. The default should be visibly a default, not something that could be mistaken for actual source data.

Add a test at the base layer to track how often the coalescing kicks in:

models:
  - name: base__payments__orders
    columns:
      - name: order__currency
        data_tests:
          - not_null:
              config:
                severity: warn

Using severity: warn instead of severity: error means the pipeline continues (the downstream model handles the null), but you still get visibility through Elementary’s alert routing.

Invalid record filtering

Some records are genuinely invalid and should be excluded from downstream analysis. Records with impossible dates, negative quantities where only positive values make sense, test records from QA environments. Filter these in the intermediate layer with explicit, testable conditions.

-- int__events__validated.sql
SELECT *
FROM {{ ref('base__analytics__events') }}
WHERE
    event__timestamp IS NOT NULL
    AND event__timestamp <= CURRENT_TIMESTAMP()
    AND event__timestamp >= '2020-01-01'
    AND user__id IS NOT NULL

Each WHERE condition is a remediation decision. Document them, either in the model’s description or as inline comments, so the next person understands why future-dated events are excluded.

Why this beats silent remediation

The observe-fix pattern has three advantages over approaches that fix problems silently.

Visibility. When the base layer test fires a warning, you know the source is misbehaving. If you silently fix things in the base model, the problem could persist for months before anyone notices the source is degrading. The test gives you a leading indicator.

Testability. Each remediation technique (deduplication, coalescing, filtering) is a SQL pattern you can unit test independently. dbt’s unit testing framework lets you verify that the deduplication logic picks the right record, that the coalescing uses the right default, that the filtering conditions match your intent. Remediation hidden in retry logic or external tools is much harder to test.

Auditability. If a stakeholder asks “why does this customer have currency ‘USD’ when they’re in Europe?”, you can trace it: the base model shows the source sent NULL for currency, the intermediate model coalesced it to USD per the documented default. The data lineage tells the complete story.

The limits

The observe-fix pattern handles data quality problems that are predictable and rule-based. Duplicate records, null values, out-of-range dates. You know what the problem looks like in advance and you write SQL to handle it.

It doesn’t handle novel failure modes: a file format change, a new column that breaks a join, an encoding issue that corrupts text data. Those need external remediation, either through the higher levels of the self-healing spectrum or through human investigation.

It also doesn’t replace proper data quality validation. The observe-fix pattern is about graceful degradation: the pipeline keeps producing output when the source has known issues. It’s not a substitute for comprehensive testing that catches problems you haven’t anticipated.

The pattern works best as one layer in a defense-in-depth approach. Retries handle transient errors. Schema drift adaptation handles structural changes. The observe-fix pattern handles data content issues. Anomaly detection catches the things you didn’t think to handle explicitly. Each layer covers different failure modes, and the combination handles the majority of production incidents without AI involvement or manual intervention.