dbt Test Alert Routing and Ownership

The gap between having dbt tests and having an effective test system is routing: getting failures to the people who can fix them, at the right urgency level, without generating enough volume that alerts become background noise.

The Broken Window Problem

The Broken Window Theory applied to data: when a team accepts a few perpetual test failures, they quickly accept more. Within months, the entire test suite is background noise. Real issues slip through because the signal-to-noise ratio has collapsed.

The discipline is simple but requires enforcement: never tolerate chronically failing tests. When a test fails, you have exactly four options:

Fix the underlying issue in the data or the transformation logic
Update test expectations if the original threshold was wrong or the business definition changed
Tag as under-investigation with a specific deadline and named owner
Delete the test if it fires on known-acceptable conditions and provides no actionable value

What you cannot do is leave it failing indefinitely and move on. That’s the choice that breaks the system.

Option 4 — deletion — is underused. Teams often feel that deleting a test is admitting defeat. It’s not. A test that always fails is worse than no test: it normalizes the failure state, trains the team to ignore alerts, and obscures real failures when they occur. If a test consistently fires on conditions you’ve decided are acceptable, delete it and write the test you actually want.

Conditional Severity

The error vs. warn distinction matters. An error blocks execution; a warn logs a message and continues. The pattern that works in practice: error in CI, warn in production.

data_tests:
  - unique:
      config:
        severity: "{{ 'error' if target.name == 'ci' else 'warn' }}"

This gives you blocking failures during code review — when you can actually fix the cause — without halting production pipelines for conditions that might be transient or expected.

For tests where both early warning and hard failure thresholds make sense, conditional severity thresholds create a two-stage system:

data_tests:
  - not_null:
      config:
        severity: error
        error_if: ">1000"
        warn_if: ">10"

This fires a warning when you exceed 10 nulls (worth investigating) and a hard error when you exceed 1000 (something is genuinely broken). The thresholds should reflect what’s normal for your data: a model with 10 million rows might have a different acceptable null rate than one with 1000 rows.

Ownership via Meta Tags

Routing alerts to the right team requires ownership metadata. Tag models with owner and criticality in their meta configuration:

models:
  - name: mrt__finance__revenue
    config:
      meta:
        owner: "finance-analytics"
        criticality: "high"
  - name: mrt__marketing__campaigns
    config:
      meta:
        owner: "marketing-analytics"
        criticality: "medium"

These meta fields are queryable — either directly in your warehouse via dbt_models tables (if using Elementary) or through dbt’s artifacts. They form the basis for routing logic in your alerting infrastructure.

Tiered Alert Routing

With ownership metadata in place, a tiered routing structure directs alerts to the right channel at the right urgency:

Criticality	Failure mode	Routing
Critical	Error during pipeline run	PagerDuty / on-call rotation
High	Test failure on mart model	Slack channel + direct message to owner
Medium	Test failure on intermediate model	Team Slack channel
Low	Warning-severity test	Daily digest email

The routing logic depends on your monitoring setup. With Elementary, the meta.channel field controls which Slack channel receives alerts per model, and alert_suppression_interval prevents repeated alerts for the same persistent failure.

Without Elementary, you can build basic routing by parsing dbt’s run_results.json artifact post-run and sending alerts based on the model metadata. It’s more engineering work but achieves the same effect.

The goal of tiered routing isn’t bureaucracy — it’s signal quality. If everyone gets every alert, no one acts on any alert. If the finance team only sees alerts about finance models, and those alerts come in at a volume they can process, they’ll actually respond to them.

Source Freshness Linked to SLAs

Source freshness checks are a specific category where severity configuration needs to reflect actual stakeholder commitments. If you’ve promised stakeholders data within 6 hours, the error threshold should be 6 hours — not 24 hours because “that’s what we set everywhere.”

The pattern: check freshness at least twice as often as your lowest SLA. If you commit to 6-hour freshness, run freshness checks every 3 hours minimum. This gives you time to diagnose and fix issues before the SLA breach becomes customer-visible.

sources:
  - name: salesforce
    freshness:
      warn_after: {count: 4, period: hour}   # warn at 4h, error at 6h SLA
      error_after: {count: 6, period: hour}
    tables:
      - name: opportunities
        loaded_at_field: systemmodstamp

Freshness thresholds that don’t correspond to actual SLAs are just arbitrary numbers. Ground them in what you’ve actually committed to, then set the warn threshold slightly before the error threshold so you have time to respond.

Incident-Driven Test Coverage

When an observability alert fires for an anomaly not covered by an existing test, convert it into a permanent dbt test. The cycle: anomaly detection tool flags an issue → on-call investigates → root cause identified → explicit dbt test written → test added to the permanent suite.

This shifts test coverage from speculative (what might break?) to empirical (what has broken). Writing the test before closing the ticket is what makes the suite grow toward the actual failure modes of the pipeline.