RAG for dbt Documentation

AI documentation tools for dbt describe what the SQL does, not what the data means to the business. Generic LLMs without grounding produce descriptions that may misrepresent business logic. Research by Chelli et al. (2024) found GPT-3.5 hallucinated 39.6% of references in medical literature reviews, with GPT-4 at 28.6%. Without grounding in business context, AI-generated dbt descriptions tend to restate column names rather than explain business meaning.

The real definitions live scattered across Slack threads, Jira tickets, PRDs, and people’s heads. What does “active” mean for your business? Is amount net or gross? Does status include draft records? A commenter on Data Engineering Weekly captured this well for SAP environments: “A field like PRCTR behaves differently across company codes. Certain document types in ACDOCA need to be excluded for specific reporting scenarios. That knowledge is tacit.”

RAG (Retrieval-Augmented Generation) is the most promising approach to bridging this gap.

The Difference Business Context Makes

Without business context:

columns:
  - name: customer__segment
    description: "The segment of the customer"

With RAG pulling from internal knowledge bases:

columns:
  - name: customer__segment
    description: >
      Categorization based on the 2024 Marketing Tier logic
      defined in PRD-102, distinguishing between self-serve
      and enterprise-managed accounts.

The first description tells you nothing. The second tells you where the logic comes from, what version it follows, and what the segments represent. This is the kind of documentation that helps both humans and AI agents understand your data. And the cycle matters: better documentation makes AI tools more effective when they read your schema descriptions to generate or review code. Poor descriptions lead to hallucinated column references and misunderstood business logic.

A RAG Implementation for dbt

Sravani Kakaraparthi documented a practical implementation in December 2025. The approach has three components:

A parser that walks the models/ directory and identifies columns with missing or placeholder descriptions
A RAG agent scoped to internal knowledge bases (PRDs, Slack archives, wikis, data dictionaries) that retrieves relevant context for each column
An injector that writes the generated descriptions back into schema.yml files

In her case, the RAG agent used Glean (an enterprise search tool), but the principle matters more than the specific tool. Any RAG setup that can query your internal docs works: Glean, a custom vector database, Notion AI, or even a curated collection of Markdown files indexed by your preferred embedding model.

The key insight is scoping. A general-purpose LLM with access to the internet hallucinates because it’s generating plausible-sounding definitions from its training data. A RAG agent scoped to your internal knowledge bases retrieves actual definitions from the places where your business logic is documented. The LLM still does the writing, but it writes from your context instead of inventing context.

The Simpler Alternative: Business Context in CLAUDE.md

If setting up a full RAG pipeline feels premature — and for most teams under a hundred models, it is — a simpler alternative gets you part of the way.

Copy relevant PRD sections, data dictionary entries, or business glossary definitions into your CLAUDE.md file. Claude Code reads this file before every interaction, so business definitions you include there will inform the documentation it generates.

For example, adding this to your CLAUDE.md:

## Business Definitions
- **Active customer**: Placed an order in the last 90 days. Does NOT include trial accounts.
- **LTV**: Net revenue (after refunds) over the customer's lifetime. Excludes shipping.
- **Customer segment**: Based on 2024 Marketing Tier logic (PRD-102). Self-serve vs enterprise-managed.
- **Amount fields**: Always net of tax unless column name includes `_gross`.

This isn’t automated. You’re manually curating the business context. But it’s effective for teams where:

The number of models is manageable (under 100)
The business definitions don’t change frequently
You don’t have enterprise search tooling like Glean in place
You want to start getting value now, not after a RAG pipeline is built

The trade-off is clear: the CLAUDE.md approach doesn’t scale, and definitions can go stale if nobody maintains them. But “manually curated and mostly accurate” is dramatically better than “no business context at all.”

Data Profiling as Additional Context

Beyond business definitions, feeding AI tools data profiling results improves description accuracy. Column statistics — min, max, distinct counts, distributions — give the AI concrete information about what the data actually looks like.

A column with only three distinct values is probably categorical, and the description should say so. A numeric column where every value falls between 0 and 1 is probably a rate or percentage. An ID column with 50,000 distinct values out of 50,000 rows is probably a primary key.

Ground the AI in your data alongside the SQL. You can do this manually (run a profiling query and paste the results into your prompt) or automate it as part of your documentation pipeline. Either way, profiling results reduce hallucination because the AI describes what it can see rather than what it imagines.

When to Graduate to Full RAG

The CLAUDE.md approach is a starting point. Consider building a proper RAG pipeline when:

Your project exceeds 100 models and manual curation can’t keep up
Business definitions change frequently (quarterly reorgs, evolving product categories)
Multiple teams contribute to documentation and need consistent business context
You already have enterprise search or knowledge management tooling that can serve as a retrieval backend
Documentation quality directly affects downstream AI workflows (code generation, data analysis) and the cost of hallucinated descriptions is high

The graduation path is incremental. Start with CLAUDE.md for the most critical definitions. Add data profiling to your documentation prompts. When the manual approach starts feeling unsustainable, build the retrieval pipeline. You don’t need to solve the full RAG problem on day one — you need business context in whatever form is sustainable for your team today.