ServicesAboutNotesContact Get in touch →
EN FR
Note

Data team on-call strategies

How data teams structure on-call rotations, triage processes, and runbooks differently from software engineering on-call, and which metrics reveal whether the system is working.

Planted
data qualitydata engineering

Data team on-call differs from software engineering on-call. A software engineer on-call is typically shielded from other responsibilities during rotation. Data teams usually are not: the same person handling a data quality incident may also attend a sprint review, answer stakeholder questions, and continue pipeline work on the same day. The alerting mechanics (covered in Elementary alert routing with filters and Elementary alert fatigue reduction) only work when the human process around those alerts is designed for this reality.

Triage: categorizing before acting

A clear severity framework prevents the two failure modes: treating everything as urgent (burns people out) or treating nothing as urgent (stakeholders notice problems before the team does).

SeverityCriteriaResponseChannel
CriticalSLA breach, production outage, revenue impactImmediate responsePagerDuty or equivalent
WarningQuality degradation, potential issuesNext business daySlack
InfoLogged for review, no action neededWeekly reviewNone

The routing decisions here connect directly to how you configure filter-based alert routing. Critical failures map to hard failure status (statuses:fail) combined with a criticality tag. Warning-severity failures map to statuses:warn. Info-level items might not generate real-time alerts at all — just appear in a weekly report or dashboard review.

The categorization matters most when an incident arrives at an inconvenient time. Without a triage framework, every alert implicitly demands immediate attention. With one, the on-call person can look at the alert, confirm it’s a Warning, and schedule investigation for the next business day rather than context-switching mid-task.

Runbooks in test metadata

The biggest time cost during an incident isn’t fixing the problem — it’s figuring out what the problem is and where to look. Runbooks solve this. The question is where to put them so they’re actually accessible when someone needs them.

Embedding runbook links directly in test metadata means they appear in the alert:

data_tests:
- unique:
column_name: customer_id
config:
meta:
description: |
Duplicate customer IDs detected.
Runbook: https://docs.company.com/data/customer-dedup
Contact: @data-platform-team

When this test fails, the alert includes the description field with the runbook link. New team members can resolve the issue without knowing the codebase or asking who to contact. The on-call rotation stops requiring institutional memory.

Good runbooks for data incidents typically cover:

  • What the failing test is checking and why it matters
  • Common root causes for this specific failure
  • Which upstream system to check first
  • Where to find relevant logs or monitoring dashboards
  • Escalation path if the runbook doesn’t resolve it

Short and specific beats comprehensive and generic. A runbook that addresses the actual failure modes you’ve seen is more useful than a thorough document that takes 20 minutes to read.

Metrics that reveal whether on-call is working

Four metrics tell you whether your alerting and on-call process is healthy or degrading:

Alert volume is the baseline. If it’s growing week over week without a corresponding growth in the number of models you’re monitoring, something is generating noise — probably tests that should be tuned, suppressed, or deleted.

False positive rate is the one that erodes trust. When alerts fire for conditions the team has decided are acceptable, people learn to ignore them. Track what percentage of alerts result in actual changes (fixes, tuned thresholds, or explicit “we accept this” decisions). A high percentage of “no action taken” alerts is a signal that test thresholds need work.

Mean time to acknowledge (MTTA) measures how quickly someone responds. If MTTA is measured in hours rather than minutes, alerts aren’t reaching the right people, aren’t reaching them in a channel they watch, or there are too many for anyone to feel urgency.

Mean time to resolution (MTTR) reveals whether runbooks and documentation are working. Long MTTR often means the on-call person has to investigate from scratch each time — either because there’s no runbook, the runbook is outdated, or the issue doesn’t recur often enough for institutional memory to form. When MTTR is consistently long for the same recurring issue, write the runbook.

The dbt Test Alert Routing and Ownership note covers the complementary practice: converting every new incident that doesn’t have a test into a permanent test. MTTR improves when incidents are prevented; MTTA improves when alerts are routed and formatted correctly.

Rotation structure for data teams

A few patterns that account for the reality of data team responsibilities:

Pairing on-call with development sprints works well. The on-call engineer handles incidents during their rotation and uses any slack time on improvements that reduce future incidents — tuning test thresholds, writing runbooks for incidents that had none, adding suppression intervals to noisy tests. This keeps the on-call work from feeling purely reactive.

Weekly rotation with a handoff document prevents context loss. The outgoing on-call engineer documents open issues, recurring problems, and anything the incoming engineer should know. This is especially important for issues that are “under investigation” — where someone spent an hour understanding the problem and the next person would have to start from scratch without a handoff.

Tiered response works for larger teams. Junior engineers handle initial triage and resolution using runbooks. Complex issues or anything without a runbook escalate to senior engineers. This both distributes on-call load and gives junior engineers exposure to incident response in a structured way.

For small data teams where everyone is on-call by default, making the implicit explicit reduces friction: define what “on-call” means, what the expected response time is, and what counts as a critical issue that warrants interrupting focused work.