ServicesAboutNotesContact Get in touch →
EN FR
Note

AI Agent Data Quality: What Works Today vs. What's Aspirational

An honest assessment of which AI agent capabilities for dbt data quality are production-ready, which require significant work but are achievable, and which are still too unreliable to depend on.

Planted
dbtelementarydata qualityautomationai

AI agent capabilities that make compelling demos are not always the same ones that work reliably in production. This note describes where the maturity line sits based on production use, in three tiers: what works today, what requires significant work but is achievable, and what is still too unreliable to depend on.

Works Today: The Reliable Foundation

These capabilities require configuration but behave predictably once configured.

Cron-triggered dbt test execution is the foundation. An agent runs dbt test --target prod on a schedule, parses the output, and delivers results to Slack. The basic pattern — cron to shell command to output parse to Slack message — is reliable once configured. The OpenClaw Cron Scheduler Mechanics note covers the specifics. This is the building block worth getting right before adding anything else, because everything else depends on it.

Basic output parsing and formatted Slack delivery works. The agent can distinguish passes from failures, categorize FAIL vs ERROR vs WARN, and format a readable summary. Output quality depends on how well you’ve written your skill instructions — a vague skill produces inconsistent output, a specific skill produces consistent output. This is entirely in your control.

Simple severity categorization based on explicit rules works well. If you define the rules precisely in your skill — “if the test name contains ‘unique’ and the model starts with ‘mrt__’, mark it as critical” — the agent follows those rules consistently. Classification tasks with clear criteria are where language models perform reliably. Agents are good at applying rules; they’re less reliable at inferring rules from examples.

These three capabilities together provide structured output, categorized failures, and actionable summaries. This baseline is useful independently and does not require any of the more complex building blocks.

Requires Work but Achievable

These capabilities work but carry maintenance overhead and engineering cost.

Documentation cross-referencing through persistent memory is viable, but there’s no automated sync between your schema.yml files and the agent’s memory. You load your model and column descriptions manually, and you update them manually when models change. The cross-referencing itself works well — when the agent looks up mrt__sales__customers.customer__id and reports that “duplicates mean double-counted revenue,” that’s genuinely useful. But if you add a model, rename a column, or change a description and forget to update the memory document, the agent reports stale context confidently. Stale context is worse than no context.

Downstream impact analysis using manifest.json is technically achievable. The agent can parse JSON and extract dependency information. The practical problem is size: manifest files for large projects can be several megabytes, and feeding the full manifest into a language model’s context window isn’t practical. The workaround is pre-processing — extract the dependency graph into a compact summary format and store that in memory instead. This requires a script to run at deployment time (or on PR merge) and adds a maintenance step whenever the DAG changes significantly. The result is useful; the setup cost is real.

Historical pattern tracking via persistent memory is feasible but fragile. The agent can maintain a failure history file, update it after each run, and surface historical context in reports (“this test has failed 4 of the last 7 days”). What it can’t do is query that history reliably with structured filters. “Find all failures of test X in the last 30 days” requires the history file to be organized in a way that makes that retrieval straightforward — which means careful design of the memory format and regular pruning to keep the file scannable. It works, but it’s more like a structured log the agent reads than a database the agent queries.

Still Aspirational: Don’t Depend on These Yet

These capabilities appear in vendor demos and architecture diagrams but do not behave reliably enough for production monitoring.

Automatic test generation — the agent notices a column has no tests and suggests the right test for it — requires understanding business context that language models handle inconsistently. Suggesting not_null for a column that appears nullable is straightforward. Knowing that order__revenue_usd should have a range test with a maximum of roughly $50K (because you’ve never had an order over that threshold and anything larger is probably a data error) requires domain knowledge that isn’t in the schema. The taxonomy of test types is learnable; the business judgment about which specific thresholds to apply is not.

Self-healing pipelines — the agent detects a failure and modifies the dbt project to fix it — are in “don’t do this” territory. An agent with write access to production dbt models creates risks that outweigh the convenience by a large margin. This isn’t a maturity problem that gets better with better models; it’s a fundamental architecture question about where human review belongs in the change process. The Security Posture for AI Agents note covers why read-only is the right default posture for monitoring agents. Write access is a different category entirely and requires human review gates that negate most of the automation value.

Anomaly detection without explicit tests — the agent notices that row counts are unusually low this morning without a specific test configured for it — is interesting but unreliable. Language models are not statistical anomaly detectors. Asking them to assess whether today’s 9,847 rows is anomalous compared to the usual 10,200-12,400 range produces inconsistent results. They may correctly identify the anomaly. They may flag it as normal. They may flag it as anomalous when it isn’t. For anomaly detection, dedicated tools that use actual statistical methods — Elementary’s Z-score approach or the ML-powered methods in commercial platforms — perform more reliably than agents reasoning from raw numbers. This is one area where purpose-built tools genuinely earn their keep.

The Layered Approach

The appropriate architecture for most teams is AI agents layered on top of observability tools, not replacing them.

Elementary, Monte Carlo, and dbt Cloud’s built-in alerting do the baseline job better than an agent: structured storage, consistent formatting, reliable historical tracking, statistical anomaly detection. If your primary need is “know when dbt tests fail and track the patterns,” those tools are more mature and less risky.

An agent like OpenClaw adds things those tools don’t provide: contextual summaries written in plain language with business implications, natural language follow-up via Slack (“show me the 3 duplicate customer IDs”), cross-system correlation (combining dbt failures with BigQuery cost anomalies and source system status), and the flexibility to monitor arbitrary systems without waiting for a connector to exist.

A practical layering:

LayerToolWhat it does
Anomaly detectionElementary or commercialStatistical monitoring, no threshold configuration needed
Explicit test alertsdbt native + ElementaryStructured routing of known test failures
Contextual summariesOpenClaw agentPlain-language briefings with business implications
Follow-up investigationOpenClaw agentConversational interface for ad-hoc queries

Use the build vs. buy framework to decide where each layer makes sense for your team size and project complexity. For a solo consultant with a few client projects, the OpenClaw layer might be the primary monitoring interface with Elementary catching the anomalies. For a larger team with SLA commitments, the commercial tool handles the alerting and the agent handles the communication.

Maintenance

AI agent monitoring is more flexible than dedicated tools and less reliable. Maintenance overhead concentrates in three areas: keeping memory documents current, calibrating skill instructions as failure patterns evolve, and periodically reviewing agent outputs for systematic misclassifications.

Start with the basics — a cron job, a simple skill, Slack delivery — and add complexity only when there is a specific problem the basics do not solve. The building blocks are independently useful: output parsing is valuable without persistent memory; severity classification is valuable without documentation cross-referencing.