ServicesAboutNotesContact Get in touch →
EN FR
Note

dbt documentation drift detection

Techniques for detecting when dbt documentation falls out of sync with reality — column-level drift, git-based staleness signals, and schema drift for sources

Planted
dbtdata qualityautomation

Coverage metrics tell you whether descriptions exist. Drift detection tells you whether they’re still accurate. A project can report 95% documentation coverage while half those descriptions silently describe the wrong thing because the underlying models changed.

Drift takes three forms, each requiring a different detection approach: columns that exist in the warehouse but not in YAML, SQL that changed without corresponding YAML updates, and source schemas that shifted upstream.

Column-level drift

The most concrete form of drift: the warehouse has columns that your YAML doesn’t know about, or your YAML describes columns that no longer exist.

dbt-checkpoint at commit time

The check-model-has-all-columns hook from dbt-checkpoint catches this during development. It compares the columns in your compiled model against the columns in your YAML file and blocks the commit if they don’t match:

.pre-commit-config.yaml
repos:
- repo: https://github.com/dbt-checkpoint/dbt-checkpoint
rev: v2.0.6
hooks:
- id: dbt-parse
- id: check-model-has-all-columns
files: ^models/marts

The files: regex scopes this to marts only. Requiring all-column documentation for every staging model is usually too aggressive — those models change frequently and serve as internal implementation details.

Most dbt-checkpoint hooks require a fresh manifest.json, which is why dbt-parse runs first. This adds a few seconds to each commit but catches column drift that would otherwise surface in code review (or never surface at all).

dbt-osmosis for project-wide audit

For a project-wide check beyond what pre-commit hooks catch, dbt-osmosis can run in audit mode:

Terminal window
dbt-osmosis yaml audit

This exits non-zero if any model’s YAML is out of sync with the warehouse — columns added, removed, or reordered. Running it as a scheduled CI job (not just on PRs) catches drift that accumulates between pull requests, like when someone modifies a source table outside the dbt workflow or when a model’s compiled output changes due to an upstream dependency change.

The refactor command (dbt-osmosis yaml refactor) goes further and actually fixes the drift by adding missing columns, removing stale ones, and propagating descriptions. But for detection purposes, yaml audit gives you the signal without making changes.

Git-based staleness

A subtler form of drift happens when model SQL changes but its documentation doesn’t. The YAML might describe the correct columns, but the descriptions are outdated because the transformation logic changed.

Consider a mart model mrt__marketing__customer_ltv.sql that was modified in five PRs since its schema YAML was last touched. There’s a reasonable chance the descriptions are outdated — a column that was “total revenue” might now be “total revenue net of refunds” after a SQL change, but the description still says the old thing.

You can detect this with a Git-based comparison. A script that compares the last-modified dates of .sql files against their corresponding YAML entries flags models where the SQL changed more recently than the documentation:

#!/bin/bash
# Find models where SQL was modified more recently than schema YAML
for sql_file in models/**/*.sql; do
model_name=$(basename "$sql_file" .sql)
yaml_file=$(dirname "$sql_file")/schema.yml
if [ -f "$yaml_file" ]; then
sql_date=$(git log -1 --format="%at" -- "$sql_file")
yaml_date=$(git log -1 --format="%at" -- "$yaml_file")
if [ "$sql_date" -gt "$yaml_date" ]; then
echo "STALE: $model_name (SQL updated $(date -d @$sql_date +%Y-%m-%d), YAML last touched $(date -d @$yaml_date +%Y-%m-%d))"
fi
fi
done

This is a heuristic, not a proof of staleness. A SQL change that only reformats whitespace does not make documentation stale. Models with the largest gap between SQL and YAML modification dates are the most likely candidates for stale descriptions. Running this as a weekly scheduled job and posting results to Slack or a dashboard keeps staleness visible without blocking any workflow.

Limitations of the git approach

The script above treats each schema YAML file as a unit. If your YAML files contain multiple models (which is common), a change to one model’s section updates the file timestamp for all models in that file. This means the script can miss staleness for models that share a YAML file with a recently-documented model.

More precise detection would parse the YAML and compare per-model sections, but the simple file-level comparison catches the most common cases and is trivial to set up.

Schema drift for sources

Source tables sit outside your dbt project’s control. Someone on the application team adds a column to the users table, changes a column type, or drops a field — and your dbt source definitions don’t know about it.

Source freshness as a proxy signal

dbt source freshness doesn’t directly detect schema changes, but it provides a useful proxy. If a source hasn’t been refreshed on schedule, the data flowing through your models may not match what the documentation describes. Freshness failures are a signal to investigate whether the source schema changed too.

Dedicated schema drift detection

dbt_schema_drift detects when source schemas change upstream. It compares the current warehouse schema against your declared source definitions and flags differences — new columns, removed columns, changed types.

This is particularly important for sources because:

  • Source changes happen outside your team’s PR process
  • There’s no pre-commit hook that can catch external schema changes
  • The gap between a source schema change and its discovery can be weeks or months
  • Documentation written against the old schema becomes misleading immediately

Running schema drift detection as a scheduled CI job (daily or on each pipeline run) gives you early warning when sources change. When drift is detected, you can update source YAML, propagate changes downstream with dbt-osmosis, and review whether any descriptions need updating.

Layering drift detection

No single tool catches all forms of drift. A practical setup combines them:

Drift typeToolWhen it runsWhat it catches
Column mismatchdbt-checkpointPre-commitColumns added/removed in changed models
Column mismatchdbt-osmosis auditScheduled CIProject-wide column drift
Description stalenessGit date comparisonScheduled (weekly)SQL changes without YAML updates
Source schema changesdbt_schema_driftScheduled CIUpstream schema changes

Pre-commit hooks catch drift as it is introduced. Scheduled jobs catch drift that accumulates between PRs or from external schema changes.