dbt Test Failure Severity Framework

This framework assigns one of four severity tiers to each dbt test failure, based on test type, model layer, downstream dependents, and historical context. It is designed for automated application — either by an AI agent following skill instructions or by a post-processing script reading run_results.json — to produce a ranked list before a human reviews the morning summary.

The Four-Tier System

Critical means act immediately, before anything else. Failures at this tier indicate data is already wrong in places that matter to stakeholders right now.

Primary key or uniqueness failures on mart-layer models
Source freshness more than 24 hours behind schedule
Any test failure on a model powering a client-facing or executive-facing dashboard
not_null failures on foreign keys that join to primary reporting dimensions

The critical tier is about current damage, not potential damage. If mrt__sales__customers.customer__id has duplicates, revenue is double-counting in the sales dashboard right now. That is a different category of problem than a freshness warning on a dataset that feeds an internal operational report.

High means investigate today, within business hours. Something is wrong and will become more wrong if ignored, but it’s not currently causing visible damage at the stakeholder level.

Referential integrity failures (relationships tests) on intermediate or mart models
not_null failures on key dimension columns in intermediate models
Source freshness 6-24 hours behind schedule
Singular test failures encoding important business rules

Medium means investigate this week. These failures are real problems, but they’re not urgent.

accepted_values violations on categorical fields
Row count anomalies on intermediate models — significantly more or fewer rows than expected
Failures on intermediate models with no mart-level downstream dependents
Recurring failures you’ve already investigated and determined to be low-risk

Low means track and batch. Include in weekly digests rather than daily alerts.

Documentation warnings
Minor schema changes that don’t break any downstream models
Failures on development or sandbox targets that appear in monitoring due to target misconfiguration
Warnings on models with zero downstream dependents

Weighting by Downstream Impact

Tier assignment alone doesn’t capture urgency. A medium-tier failure on a model that feeds ten marts is more urgent than a critical-tier failure on a model that feeds nothing. Downstream impact is a multiplier on the base severity.

The practical rule: check how many mart-layer models are in the downstream dependency tree of the failing model. If the number is high, promote severity by one tier. If the model has zero dependents, demote by one tier.

## Severity adjustment for downstream impact

After assigning initial severity, check downstream dependents:
1. Run: dbt ls --select +[failing_model_name] --resource-type model
2. Count how many results are mart-layer models (start with mrt__)
3. If 3 or more mart models are downstream: increase severity by one tier
4. If zero downstream models: decrease severity by one tier

A critical failure with no downstream dependents becomes high.
A medium failure with 5 downstream marts becomes critical.

You can pull downstream information from two places. dbt ls --select +model_name gives you the list at runtime, but adds time to the monitoring job. manifest.json contains the full dependency graph and can be pre-processed into a compact format that the agent reads without executing dbt commands. For large projects, pre-processing the manifest is worth the setup cost.

The Role of Test Type in Severity

Not all test types warrant the same default severity. The appropriate response to a source freshness failure is different from the appropriate response to a uniqueness failure, even when both are classified as critical.

Source freshness failures are timing problems, not data quality problems. The underlying data might be perfectly correct — it just hasn’t arrived yet. The first-line response is always “wait and check again” rather than “investigate the transformation logic.”

Uniqueness failures on primary keys are data integrity problems. Duplicate customers in mrt__sales__customers means double-counted revenue. The response is to investigate the deduplication logic in the upstream base models, not to wait.

not_null failures on mart models are typically caused by one of two things: upstream source data genuinely contains nulls (data quality problem) or the source data hasn’t loaded yet (timing problem). The source freshness context tells you which. This is why running source freshness as a pre-check before the main test run is worth the extra minute — it gives you the context to interpret not_null failures correctly.

Singular test failures are the most context-dependent. A singular test encodes a specific business rule, and whether a failure is critical or low depends entirely on what that rule is. If the test is assert_no_negative_revenue_orders, that’s critical. If it’s assert_all_orders_have_a_utm_source, that might be medium or low depending on how central UTM tracking is to your reporting.

Building test type into the severity logic:

## Test type severity modifiers

For source freshness failures:
- Default to High rather than Critical regardless of recency
- Note: "Check upstream connector before investigating dbt models"

For uniqueness failures on mart models:
- Default to Critical
- Note: "Duplicate records likely cause double-counting in downstream reports"

For singular test failures:
- Assign severity based on test name: tests with 'revenue', 'customer', or 'order'
  in the name default to High; others default to Medium
- Flag for human review if unclear

Historical Context as a Severity Modifier

A test failing for the first time warrants different urgency than a test that has failed intermittently for two weeks. First-occurrence failures are more likely to be real, novel problems requiring immediate investigation. Recurring failures might be known issues being tracked, vendor data delays that happen on specific days, or tests that should be updated or deleted.

When persistent memory is available — either through an AI agent’s memory system or a structured failure history table — annotate each failure with temporal context:

First occurrence: highest urgency, treat as new and unknown
Recurring failure: note the pattern (“failing 4 of last 7 days”), lower immediate urgency, flag for structural resolution
Previously investigated: include your prior conclusion (“noted March 18: expected on weekends due to vendor batch timing”)
New test, never passed: flag as likely configuration issue rather than data issue

This context doesn’t change the underlying tier but changes what you do about it. A critical, first-occurrence uniqueness failure gets investigated immediately. A critical, recurring uniqueness failure on a model you triaged last Tuesday gets checked against your prior notes before you start a new investigation.

Applying the Framework in Skill Instructions

If you’re using an AI agent to apply this framework automatically, give it the rules explicitly. Agents follow instruction lists more reliably than they synthesize rules from descriptions:

## Severity classification

For each test failure, assign a tier:

CRITICAL (act immediately):
- Test type is `unique` or `not_null` on a model starting with `mrt__`
- Source freshness more than 24 hours overdue
- Test name references a dashboard or client-facing model (look in model meta)

HIGH (investigate today):
- Test type is `relationships` on any layer
- `not_null` on intermediate models with mart dependents
- Source freshness 6-24 hours overdue

MEDIUM (investigate this week):
- `accepted_values` violations on any layer
- Row count anomalies
- Failures on intermediate models with no mart dependents

LOW (weekly digest):
- Warnings
- Schema changes on models with no downstream dependents
- Tests on non-production targets

The specificity here is deliberate. “High-impact failures” is not a useful instruction for an AI agent. “Test type is unique or not_null on a model starting with mrt__” is.

What This Framework Is Not

This framework is for runtime triage — ranking the failures from a specific dbt test run. It’s separate from the question of how to route those failures to the right people once ranked, which is handled by dbt Test Alert Routing and Ownership, and from the question of which tests to write in the first place, which is covered by dbt Testing Strategy by Layer.

The framework does not replace human judgment for novel failure patterns. An agent applying these rules will occasionally misclassify; the intent is to handle the straightforward triage automatically so human review focuses on genuinely ambiguous cases.