ServicesAboutNotesContact Get in touch →
EN FR
Note

dbt Test Output Parsing for Automated Monitoring

How to extract structured, actionable information from dbt test output — distinguishing failure types, capturing sample rows, and handling partial runs so automated monitoring doesn't miss anything.

Planted
dbtdata qualityautomationai

Raw dbt test output contains everything you need to understand a failure. It also requires real interpretation before it becomes useful. “The unique test on mrt__sales__customers.customer__id failed with 3 rows” is not the same as understanding that this is a primary key violation on a mart model with seven downstream dependents, three of which feed client-facing dashboards.

Getting from raw output to that second version requires deliberate parsing. Whether you’re feeding this output to an AI agent via a skill file or writing post-run scripts yourself, the parsing logic is the same — and it’s worth getting right before adding anything else on top.

What to Extract Per Failure

For each failing test, the information you want is:

  • Test name — the full identifier, e.g. unique_mrt__sales__customers_customer__id
  • Test type — generic built-in, custom generic, singular, or source freshness
  • Model being tested — derived from the test node path
  • Column being tested — present in the test configuration for column-level tests
  • Failure count — how many rows violated the test condition
  • Sample failing rows — concrete examples that point toward root cause

The test name encodes the type if you parse it correctly. Generic tests like unique, not_null, accepted_values, and relationships appear as prefixes in the test node name. A test named unique_mrt__sales__customers_customer__id is a unique generic test on the customer__id column of mrt__sales__customers. A test named assert_no_orphaned_orders is a singular test with a custom name. Package-based tests like those from dbt_expectations appear with the package prefix: expect_column_values_to_be_between_mrt__sales__orders_order__amount.

If you’re instructing an AI agent, give it this explicitly:

## Parsing test names
For each test result, extract:
1. Test name and type (look for `unique`, `not_null`, `accepted_values`,
`relationships`, or `dbt_expectations` in the test name)
2. The model being tested (from the test node path)
3. The column being tested (from the test configuration)
4. Number of failing rows
5. If --store-failures is enabled, query the failure table to get
3-5 sample failing rows for context

Showing the agent an instruction list works better than describing parsing rules abstractly. The numbered structure gives it a protocol to follow, not principles to interpret.

Getting Sample Failing Rows

Knowing that a unique test failed with 3 rows tells you there’s a duplication problem. Seeing those 3 rows tells you where to look. There are two ways to get them.

--store-failures tells dbt to write failing rows to a table in your warehouse. Run dbt test --store-failures and dbt creates tables like dbt_test__audit.unique_mrt__sales__customers_customer__id in a dedicated schema. You can then query that table to retrieve sample rows.

dbt show is a lighter alternative for exploring specific failing models after a run. It doesn’t require --store-failures and works against any model:

Terminal window
dbt show --select unique_mrt__sales__customers_customer__id --limit 5

For automated monitoring, --store-failures is the better choice because the agent can query the failure table automatically without additional setup per failure. Add this to your skill file:

## Getting sample rows
After running `dbt test --store-failures`:
- For each failing test, check if a failure table exists in dbt_test__audit
- If it does, query it: SELECT * FROM dbt_test__audit.[test_name] LIMIT 5
- Include the sample rows in the failure report
- If no failure table exists (some test types don't store failures), omit this step

Distinguishing Failure Types

dbt has five testing mechanisms, but in practice, the failure types that matter most for automated monitoring are these three:

Data failures (FAIL) — the test ran successfully and found rows that violate the condition. This means the pipeline ran; the data is wrong. A not_null failure means null rows exist in the output. A unique failure means duplicate rows exist. A relationships failure means orphaned foreign keys exist. The appropriate response is to investigate the data and its upstream sources.

Pipeline failures (ERROR) — the test couldn’t execute at all. Compilation errors, missing models, connection failures, and undefined source references all produce ERRORs. The data may be perfectly fine; something in the infrastructure or model definition is broken. The appropriate response is to look at the dbt project itself, not the data.

Soft violations (WARN) — tests configured with severity: warn report problems but don’t fail the run. These are conditions the team has decided are tolerable in production but worth knowing about. The appropriate response is to note them and revisit during normal working hours, not to wake anyone up.

These three categories require different responses, different urgency levels, and different investigations. An automated system that treats them identically will produce alert fatigue quickly. Build this distinction in from the start:

## Failure type classification
When you see FAIL:
- The data violated a test condition
- Report: test name, model, failing row count, sample rows if available
- Investigate: check upstream source freshness, deduplication logic, or business rule changes
When you see ERROR:
- The test couldn't execute
- Report: error message verbatim, which model failed to compile
- Investigate: dbt compilation errors, missing models, connection issues
When you see WARN:
- A soft threshold was exceeded
- Report: which test, which model, the count
- Flag: "worth investigating but not blocking"

Not all failure types apply equally to all test types. Source freshness failures always appear as separate entries with their own format: raw_shopify.orders with a freshness lag measured in hours. Treat these as a fourth category that indicates timing problems rather than data quality problems — a not_null failure at 6 AM might be caused by a source freshness failure that you need to see first.

Handling Partial Runs

Most dbt test runs in production return a mix of passes, failures, warnings, and errors. Automated monitoring needs to handle all four states without getting confused by the presence of any one of them.

A compilation error on one test should not prevent reporting on the 140 tests that did run. A freshness failure should not obscure the data quality failures that may or may not be caused by it. Explicit instructions prevent the parsing logic from short-circuiting on the first problem it encounters:

## Handling mixed results
If some tests fail and others pass:
- Always report the total: "142/147 tests passed, 3 failed, 2 warnings"
- List failures and warnings individually with full detail
- Do not skip passed tests in the count
- If any test has a compilation ERROR (not a data FAIL), report it
separately as a pipeline issue, not a data quality issue
Do not stop parsing after the first failure. Process all results.

The total count matters because context shapes urgency. “3 tests failed” feels different when 144 others passed versus when 20 others also failed. Always include it.

The --target prod Flag

When running automated monitoring against production data, always specify the target explicitly. Without it, dbt uses the default target in your profiles.yml, which is typically a development environment.

Terminal window
dbt test --target prod --store-failures

This sounds obvious but is a common source of confusion when setting up automated monitoring for the first time. The agent runs the command, sees zero failures, and you spend time wondering why — until you realize it was testing an empty development schema.

What Comes After Parsing

Good output parsing is the foundation. Once you have structured failure information, the next steps are severity assessment (which failures need attention first?) and documentation cross-referencing (what does this column actually represent, and what breaks if it’s wrong?). Both depend on having clean, structured parsing as the input.

For the full pattern of assembling this into a morning summary, see the dbt Quality Morning Summary Pattern note. For the mechanics of delivering it via an AI agent on a schedule, start with OpenClaw Cron Scheduler Mechanics.