An ErrorLLM study found that only 3% of incorrect SQL queries raise execution warnings. The other 97% run successfully and return wrong results. AI-generated SQL that compiles and runs is therefore not evidence of correctness — it is the primary failure mode.
The Compile-But-Wrong Problem
The traditional assumption in software development is that errors manifest as failures — exceptions, crashes, test failures. In SQL, this assumption is catastrophically wrong. A query with a missing JOIN condition doesn’t fail; it produces a cross join and inflates row counts. A query with inconsistent temporal filters doesn’t fail; it compares apples to oranges across time periods. A query that aggregates at the wrong grain doesn’t fail; it reports plausible-looking numbers that happen to be wrong.
AI tools make this worse for a specific reason: they’re optimized to produce code that compiles and runs. An LLM trained on code has a strong prior toward syntactically valid output. It rarely generates a query that fails to parse. It regularly generates a query that parses, executes, and returns a result set that doesn’t reflect what you asked for.
Real-World Examples
Thomson Reuters Labs documented this in production. They found that 73% of AI-generated time-based SQL analyses had inconsistent temporal filters. Every one of those queries ran without errors.
The specific failure pattern: an AI agent generated conversion rate SQL that inflated results by roughly 40% because temporal filters were applied inconsistently across joined tables. One table was filtered to the current month. The joined table wasn’t filtered at all, or was filtered to a different time range. The resulting conversion rate was mathematically valid but semantically meaningless.
78% of SQL errors in their analysis occurred when the agent joined tables without properly identifying primary key relationships. The classic case: joining on a non-unique key that produces a silent fan-out. If orders has one row per order but order_items has multiple rows per order, joining them without aggregating first silently multiplies your revenue numbers. The query runs. The numbers look reasonable. They’re just wrong.
A practitioner at Recce documented a related pattern with Claude Code. The agent built dbt models from scratch on Snowflake — sources, base models, intermediates, and marts. It followed naming conventions, used CTEs properly, and even made certain intermediate models incremental without being asked. But it also silently filtered out rows with missing org_id values, making a data quality decision that should have been flagged for a human. The output was clean, well-structured, and quietly incomplete.
Why This Is Structurally Hard to Fix
The compile-but-wrong problem isn’t a bug in any particular AI tool. It’s a consequence of how SQL works. SQL is declarative — you describe what you want, and the engine figures out how to get it. There’s no type system that catches “you joined on the wrong grain.” There’s no compiler warning for “your temporal filters are inconsistent.” The language itself doesn’t distinguish between a correct result and a plausible-but-wrong one.
AI tools compound this because they lack the domain context to know what “correct” means. A model trained on millions of SQL examples can generate perfectly idiomatic code that misunderstands your specific business rules. It doesn’t know that your orders table includes cancelled orders that should be excluded from revenue calculations. It doesn’t know that your event_timestamp is in UTC but your business reports in EST. It doesn’t know that joining users to sessions should be a LEFT JOIN to preserve users with zero sessions, not an INNER JOIN that drops them.
What Catches These Errors
Linters and formatters catch nothing here. The SQL is well-formed. The errors are semantic, not syntactic.
Review Practices
The most reliable defense is human review focused on the specific areas where AI-generated SQL fails. When reviewing AI output, check these in order:
- JOIN conditions: Is the join on the correct key? Is the key unique on at least one side? Is the join type appropriate (LEFT vs. INNER)?
- Temporal filters: Are date/time filters applied consistently across all joined tables? Are they using the same granularity?
- NULL handling: Does the query handle NULLs in filter columns, join columns, and aggregation columns?
SUM(revenue)ignores NULLs;AVG(revenue)ignores NULLs but changes the denominator;COUNT(revenue)counts only non-NULLs whileCOUNT(*)counts everything. - Aggregation grain: Does the GROUP BY match the intended output grain? Are all non-aggregated columns in the GROUP BY?
- Filter completeness: Are there filters the AI should have applied but didn’t? (Excluding test data, cancelled orders, internal users, etc.)
Testing
The dbt Testing Taxonomy provides the structural answer. Generic tests catch some silent errors — a unique test on a primary key will catch an accidental fan-out from a bad join. relationships tests catch orphaned foreign keys. But most semantic errors require either singular tests (custom SQL assertions) or unit tests (mocked inputs with expected outputs).
For the temporal filter problem specifically, unit tests are the most effective defense:
unit_tests: - name: test_conversion_rate_temporal_consistency model: mrt__marketing__conversion_rates given: - input: ref('int__sessions') rows: - {session_id: 1, session_date: "2026-01-15", converted: true} - {session_id: 2, session_date: "2026-02-15", converted: false} - input: ref('int__orders') rows: - {order_id: 1, order_date: "2026-01-15", session_id: 1} expect: rows: - {month: "2026-01", sessions: 1, conversions: 1, conversion_rate: 1.0} - {month: "2026-02", sessions: 1, conversions: 0, conversion_rate: 0.0}If the model applies temporal filters inconsistently, the expected output won’t match. The unit test catches the exact failure mode Thomson Reuters documented.
Anomaly Detection
Elementary for dbt catches the class of silent errors that manifest as distribution shifts rather than outright wrong values. A query that inflates conversion rates by 40% will trigger a column_anomalies test if the metric has a historical baseline. Volume anomalies catch the opposite case — a query that silently drops rows produces fewer output rows than expected.
Review implications
The 3% warning rate means that standard quality gates (“does it compile?” and “does it run?”) do not catch most AI-generated SQL errors. A human developer who writes a bad join can explain their join logic; AI output provides no signal about confidence in join decisions.
Review AI-generated SQL with attention to: JOIN conditions, temporal filter consistency, NULL handling, aggregation grain, and filter completeness — the areas where semantic errors concentrate but no linter catches them.