ServicesAboutNotesContact Get in touch →
EN FR
Note

dbt documentation automation strategy

A graduated approach to automating dbt documentation freshness — from a single pre-commit hook to comprehensive drift detection, coverage tracking, and AI remediation

Planted
dbtautomationdata quality

This guide covers a graduated approach to dbt documentation automation. Each layer addresses a different failure mode. Prerequisites: a working dbt project with a manifest.json.

Layer 1: Prevent new gaps (any team size)

Install dbt-checkpoint with check-model-has-description as a pre-commit hook. This single step prevents new undocumented models from entering the project without touching existing debt.

.pre-commit-config.yaml
repos:
- repo: https://github.com/dbt-checkpoint/dbt-checkpoint
rev: v2.0.6
hooks:
- id: dbt-parse
- id: check-model-has-description

New models get descriptions enforced at commit time. Existing models are unaffected. This is the minimum viable documentation automation — no infrastructure required beyond the hook configuration.

The hook requires a fresh manifest.json (hence dbt-parse running first), which adds a few seconds to each commit.

FINN, a global car subscription company, adopted this approach and reported that automated checks eliminated the trade-off between developer velocity and data quality. Documentation standards didn’t depend on reviewers catching every gap manually.

Layer 2: Measure and enforce (growing projects)

Add dbt-project-evaluator to CI. The coverage metrics from fct_documentation_coverage give you a baseline and show whether you’re gaining or losing ground.

dbt_project.yml
vars:
dbt_project_evaluator:
documentation_coverage_target: 100

Start with severity set to “warn” so it doesn’t block deployments while you address existing gaps. This gives you visibility without friction. Once coverage is at a reasonable level, switch to “error” to prevent backsliding.

For more granular control, add dbt_meta_testing to enforce documentation requirements per folder:

dbt_project.yml
models:
your_project:
marts:
+required_docs: true

This reflects the practical reality that mart models (which analysts query directly) need thorough documentation, while staging models (internal implementation details) can be documented more lightly.

Add coverage tracking

Layer coverage tracking on top of enforcement. Record the dbt-coverage or dbt-project-evaluator output from each CI run so you can spot trends. A project losing 1-2% coverage per month has a process problem that individual PR checks won’t catch.

Layer 3: Address existing debt (projects with documentation gaps)

Use doc blocks to define common column descriptions once and reference them everywhere. The 10-15 column names that appear across the most models (customer_id, created_at, order_id) account for the bulk of duplicated and inconsistent descriptions.

Combine with dbt-osmosis to propagate descriptions from parent models. When you document customer_id in your base model, osmosis copies that description to every downstream model that uses the same column. Teams report this saves 50-80% of documentation time for new models.

Terminal window
# Propagate descriptions through lineage and fix missing columns
dbt-osmosis yaml refactor

Running osmosis as a pre-commit hook keeps YAML files in sync with your actual schema going forward. It catches the common drift problem: someone adds a column, the YAML doesn’t get updated, documentation slowly diverges from reality.

Layer 4: Comprehensive automation (mature projects)

Add drift detection to catch documentation that exists but has become stale:

  • dbt-osmosis yaml audit for project-wide column drift
  • Git-based date comparison for SQL that changed without YAML updates
  • dbt_schema_drift for upstream source changes

Layer dbt-coverage or dbt-score into CI for trend tracking with PR-level visibility. The --cov-format markdown flag generates coverage tables you can post as PR comments.

Consider AI-assisted generation for bulk remediation. One practical pattern is a scheduled job that opens PRs for review: the automation flags models with missing or thin descriptions, generates drafts, and submits them through the normal review process. No documentation merges without human approval, but the human starts from a draft instead of a blank field.

AI works best as an accelerator, not a replacement. LLMs generate solid first drafts for the mechanical parts — column types, basic relationships, data lineage descriptions. Humans add the business context that makes documentation actually useful: why this metric exists, what decisions it drives, and which edge cases to watch for.

The full stack

At maturity, the layers look like this:

LayerToolsFailure mode addressed
Pre-commitdbt-checkpoint, dbt-osmosisNew gaps, column drift on changed models
CI enforcementdbt-project-evaluator, dbt_meta_testingProject-wide coverage requirements
Coverage trackingdbt-coverage, dbt-scoreTrend erosion over time
Drift detectiondbt-osmosis audit, git comparison, dbt_schema_driftStale descriptions, source changes
AI remediationdbt Copilot, Claude Code, scheduled generationBulk gap filling with review

Documentation maintenance becomes a mostly automated process with humans reviewing and refining rather than writing from scratch.

Notes on coverage targets

100% coverage is not always the right target. Stale documentation causes more damage than missing documentation, so accuracy matters more than completeness. Automation enforces whatever standards are set — a documentation style guide should define those standards before automation is applied.