Elementary alert fatigue reduction

This note covers Elementary-specific configuration for controlling how alerts are generated — suppression intervals, grouping, content trimming, and sampling controls. dbt Test Alert Routing and Ownership covers the organizational side: routing alerts to the right people and keeping the test suite curated.

Suppression intervals

The most common source of alert fatigue is a test that stays failing. If a data quality issue can’t be fixed immediately — maybe the source system is degraded, or the fix requires coordination with another team — edr monitor will fire the same alert on every run. In a pipeline that runs hourly, that’s 24 alerts per day for one problem.

alert_suppression_interval sets a minimum time between alerts for the same failing test:

meta:
  alert_suppression_interval: 24  # hours

With this set, a test that fails at 9am won’t alert again until 9am the next day, regardless of how many runs happen in between. The first alert is still sent; subsequent ones are held until the interval expires.

Configure at the test, model, or project level. Project-wide defaults make sense if you want a global policy:

models:
  your_project:
    +meta:
      alert_suppression_interval: 12

Individual models can override this with a shorter or longer interval based on their sensitivity. A revenue model that needs immediate attention on every failure might have a 1-hour interval. A backfill model that’s expected to show anomalies during a migration might have a 48-hour interval.

Alert grouping

Cascading failures are particularly noisy. When an upstream source goes down, every model that depends on it fails. Without grouping, you get one alert per failed test — potentially dozens of messages for a single root cause.

edr monitor --group-by table

Instead of 10 separate alerts for 10 failed tests on the same table, you get one message listing all the failures. The alert still tells you everything you need to know, but your channel doesn’t fill up with near-identical messages.

You can set a threshold to control when grouping kicks in:

edr monitor --group-alerts-threshold 5

Below 5 failures, send individual alerts. Above 5, consolidate into a single grouped message. This preserves the detail of individual alerts for small numbers of failures while preventing floods during larger incidents.

Controlling alert content

Not every field in an Elementary alert adds value. By default, alerts include a standard set of metadata. You can trim this to the fields that are actually useful for your team:

meta:
  alert_fields: ["description", "owners", "tags", "subscribers"]

Remove fields that generate noise without aiding triage. If your team never uses the “resource type” field to make decisions, taking it out reduces visual clutter and makes the fields that matter easier to spot quickly.

Handling sensitive data

By default, Elementary includes sample rows from failing tests in Slack alerts. This is useful for debugging — seeing that three rows have null customer_id values is more actionable than knowing the test failed. But for tables containing PII, sending sample data to a Slack channel creates a compliance problem.

Disable samples globally:

edr monitor --disable-samples

Or per model:

models:
  - name: mrt__customers__personal_info
    meta:
      disable_samples: true

Per-model configuration is worth the extra effort. Most tables don’t contain PII, and sample data is genuinely useful for those. Blanket disabling across the board loses the debugging benefit for everything else.

What Elementary Cloud adds

Elementary Cloud handles one problem that OSS can’t easily solve: automatic incident grouping across runs. When a new failure is related to an open incident — same table, same test type, same time window — Cloud groups it into the existing incident rather than creating a new one. Successful runs automatically close incidents, so you don’t accumulate a backlog of stale open issues that need manual cleanup.

For teams running more than 50-100 tests and finding that their Slack channels are still noisy despite suppression intervals and grouping, this automated incident management is usually what shifts the experience from “alert system we have to manage” to “alert system that manages itself.”

The operational threshold for when Cloud’s incident management earns its cost over OSS configuration tuning is covered in the observability tool landscape and the scaling thresholds notes.