AI Judgment Failures in dbt Development

AI judgment failures in dbt development are distinct from SQL execution failures (temporal filter inconsistencies, wrong join keys). Judgment failures are architectural: the model is structurally correct but wrong for the specific project, data, or business logic. They surface after code reaches production, when numbers do not match expectations.

The Recce team documented this category in a case study of letting Claude Code build an entire dbt project end-to-end — Snowflake ingestion through base models, intermediates, and marts. The AI followed naming conventions and CTE patterns precisely; failures occurred where conventions do not reach.

The Three Failure Types

Wrong Join Type

Claude Code chose inner joins where left joins were needed. The result: records with null foreign keys were silently filtered out. The mart looked correct — it had the right columns, the right grain, the right aggregations. It was just missing rows. Revenue totals were understated. Customer counts were understated. There was no error message, no test failure, nothing to alert you to the problem unless you already knew what the correct numbers should be.

This is the most dangerous failure mode because it’s self-hiding. An inner join that drops null rows produces a clean, plausible dataset. The model compiles. The tests pass (unless you have a test that validates against expected row counts). The downstream dashboards load without errors. The numbers are wrong.

The AI chose inner joins because inner joins are the conventional choice in SQL when you’re confident both sides of the join will match. It didn’t ask: in this business, is it possible for an order to have no customer? Is it possible for a session to have no associated user? The answer to those questions requires knowing the business — what nulls mean in this specific data model — not just the SQL.

Rebuilding Existing Assets

The same case study found that Claude Code rebuilt a date dimension that already existed in the project. It generated a complete, functional dim_date model — different from the existing one in subtle ways (different date range, different column naming, different set of fiscal calendar fields). Now the project had two date dimensions.

This is a project awareness failure, not a SQL failure. The AI knew how to build a date dimension. It didn’t know that one already existed because project inventory isn’t encoded in the schema YAML of individual models. Unless the CLAUDE.md explicitly says “the date dimension is at marts/common/dim_date — do not recreate it,” the agent has no way to know it shouldn’t generate one.

The solution is explicit CLAUDE.md documentation of existing shared assets. But this requires the practitioner to anticipate which assets are at risk of being recreated — which means understanding the AI’s blind spots before they cause problems. The CLAUDE.md pattern is the right approach; the effort is in being comprehensive enough to prevent this class of mistake.

Pulling from the Wrong Layer

The third failure: Claude Code pulled from base models instead of intermediate tables. Technically correct (the data is there), but wrong for the project’s architecture. The intermediate layer exists precisely to encapsulate business logic — joins, enrichments, deduplication — so that marts don’t have to repeat it. By sourcing from base models, the AI bypassed that logic, either duplicating it (if it got the logic right) or omitting it (if it didn’t realize it was needed).

This is a layering judgment call. The AI knows that base models exist and that marts should source from something upstream. It doesn’t know whether the right upstream is a base model or an intermediate model without understanding what logic the intermediate layer is applying and whether that logic should be in the mart’s lineage.

Agent Skills from dbt Labs address this class of failure in their coverage of layer conventions — when to source from base vs. intermediate vs. the Semantic Layer. But even with good skills files, an agent that doesn’t know what the intermediate models in your specific project do can’t reliably choose the right one.

Why These Failures Happen

The Recce team’s summary was precise: “AI-assisted analytics engineering isn’t a prompting problem. It’s an infrastructure problem.”

The failures aren’t from inadequate prompts. They’re from the AI operating without business context that no prompt can fully provide. Inner joins vs. left joins is a judgment about nullability in your business domain — something you understand because you know what happens when a customer is deleted from your CRM or when an order is placed from an unregistered session. The AI is making a structural choice without the domain knowledge to make it correctly.

This connects to The Context Gap in AI Data Engineering: the gap between what’s in the project files and what’s in the practitioner’s head. Your dbt project contains the code. It doesn’t contain why the code was written the way it was, what business events produce nulls, or which existing models should be reused rather than rebuilt.

Three categories of context that don’t live in project files:

Domain semantics: What nulls mean. What “active” means. What the business considers a valid order vs. a test order.
Project inventory: Which shared assets exist and should not be recreated.
Layer rationale: Why intermediate models exist, what business logic they encapsulate, and which marts should source from them.

What Actually Helps

CLAUDE.md with explicit prohibitions. “Don’t recreate existing shared dimensions. Existing assets: dim_date at marts/common/dim_date, dim_customers at marts/customers/dim_customers.” “Always use left joins when joining to enrichment tables; our CRM data has gaps.” These don’t require the AI to understand the business — they encode the correct choice explicitly.

Project documentation on existing assets. A model description that says “Shared date dimension. Use this model rather than building your own date scaffolding” gives the agent information it needs to not recreate it. The documentation can substitute for AI business understanding when it’s specific and explicit enough.

Structural review, not just syntax review. When reviewing AI-generated dbt models, check the join types deliberately. Check which layer the model is sourcing from. Check whether any logic being implemented already exists in an intermediate model. These aren’t things you’d normally need to check when writing the code yourself — they’re things you need to check specifically because of how AI makes decisions.

Tests that catch silent filtering. A test that validates row counts against a known baseline — or a not_null test on a column that should never be null but might be silently filtered by an inner join — catches the wrong-join failure before it reaches production. dbt Testing Strategy by Layer covers where to add these.

Starting smaller. The case study involved building an entire dbt project end-to-end. That scope amplifies judgment failures — each wrong decision compounds with others. Having AI build individual models, in the context of an existing project where the layer structure already exists and intermediate models are already in place, gives it less room to make architectural choices incorrectly.

Operating Range

AI is reliable at pattern replication, boilerplate generation, multi-file edits, and iterative debugging. It is unreliable for architectural judgment when it lacks business context. The productive division is for humans to make architectural decisions — which layer to source from, whether a shared asset exists, which join type is correct — and for AI to implement those decisions. This is narrower than “build me a dbt project end-to-end,” but the implementation work (model SQL, YAML generation, test execution, compilation debugging) is where AI saves the most time with the least risk.