AI-Generated SQL Failure Modes

97% of incorrect AI-generated SQL runs without execution warnings (ErrorLLM study). The query compiles, executes, and produces a plausible-looking result. This is different from most software bugs: a broken function in application code fails visibly, while a broken SQL query returns results that require domain knowledge to identify as wrong. The feedback loop that catches bugs is severed.

The Temporal Filter Problem

Thomson Reuters Labs documented what this failure mode looks like in production. Their AI agent generated conversion rate SQL that inflated results by approximately 40%. The root cause: the agent applied temporal filters to one table but not the joined conversions table. When they audited their entire pipeline, 73% of time-based analyses had inconsistent temporal filters.

73% of time-based queries in the Thomson Reuters pipeline had the same class of error — filters applied to some tables in a join but not others.

The pattern is predictable. When you write WHERE events.event_date BETWEEN '2025-01-01' AND '2025-03-31' but don’t constrain the joined conversions table to the same window, you get conversions from all time matched against a quarter of events. The conversion rate looks great. Finance loves the numbers. And they’re wrong.

Temporal filters are hard for AI because they require understanding the relationship between tables, not just the syntax of each query. The AI sees a date column, applies a filter, and moves on. It doesn’t reason about whether the joined table also needs temporal bounds. This is an architectural understanding — how time is modeled across the warehouse — that the LLM lacks.

This connects directly to data warehouse architecture decisions. Which tables use event time versus processing time? How do incremental models handle temporal boundaries? When late-arriving data gets incorporated, which downstream queries need to account for the lag? Getting time modeling wrong at the architecture level means every AI-generated query inherits the confusion.

The Join Problem

Thomson Reuters found another failure pattern: 78% of SQL errors occurred when the agent joined tables without properly identifying primary key relationships. The AI sees two tables with a customer_id column and joins them. It doesn’t ask whether one table has one row per customer and the other has one row per order. It doesn’t check for NULLs in the join key. It doesn’t consider whether the join should be LEFT or INNER based on the business question.

Tiger Data corroborated this from a different angle: 42% of context-less LLM-generated queries missed critical filters or misunderstood table relationships. The “context-less” qualifier matters — these were queries generated without semantic metadata about what the tables and columns actually mean.

When Tiger Data added semantic catalogs — LLM-generated descriptions of what tables and columns actually represent — accuracy improved by 27%. The fix wasn’t a better model. The fix was more context. This is why AI-powered documentation matters even if you think “nobody reads the docs.” Better descriptions feed better AI output downstream, even when the consumer is another AI.

The Confidence Problem

The ErrorLLM researchers surfaced something arguably worse than incorrect queries. When LLMs are prompted to correct a SQL query that is already correct, they comply anyway. They modify working code into broken code. They don’t say “this query is fine.” They find something to change, and the change introduces a bug.

This confidence problem compounds through complex queries. Hallucinations in early sub-queries propagate to final results. An incorrect CTE at the top of a query produces wrong intermediate results, which feed into downstream CTEs, which produce a final output that’s wrong in ways that are nearly impossible to trace without understanding what the correct output should have been.

The AI is confidently wrong, which is worse than uncertain and wrong. An uncertain system prompts human review. A confident system bypasses it.

What Actually Helps

The research points consistently in one direction: the fix is context, not model capability.

Semantic catalogs improve accuracy by giving the AI information about what data means, not just what it looks like. Tiger Data’s 27% improvement came from table and column descriptions — the kind of metadata that dbt’s testing and documentation infrastructure is designed to capture.

Structured instruction files reduce convention mismatches. Claude Code’s biggest error source was mismatched conventions — using stg_ when the project uses base__, for example. A well-maintained CLAUDE.md fixes this class of error entirely.

Human review of AI output remains non-negotiable, especially for joins and temporal logic. The Thomson Reuters finding isn’t that AI can’t write SQL. It’s that AI can’t verify SQL. The verification step — does this join make sense? are the temporal filters consistent? does the grain match expectations? — requires someone who knows what the data means.

Testing catches what review misses. The dbt Testing Taxonomy provides five mechanisms for validating data and logic. Generic tests catch structural violations (duplicates, NULLs, referential integrity). Unit tests validate transformation logic. dbt-expectations tests catch value range violations. None of these were designed specifically for AI-generated code, but they’re exactly the safety net you need when the code was written by a system that can’t reason about correctness.

These are the default behavior when AI operates without sufficient context about what the data means. AI adoption in analytics engineering has grown from 33% to 70% in two years, increasing the volume of AI-generated SQL in production pipelines.