ServicesAboutNotesContact Get in touch →
EN FR
Note

Self-healing risk tiering

A framework for deciding which pipeline failures can self-heal automatically, which need human approval, and which should never be auto-remediated.

Planted
data engineeringdata qualityai

Risk tiering is a framework for deciding which pipeline failures can self-heal automatically, which need human approval before a fix is applied, and which should remain human-only. Not every failure should self-heal: a file with a changed delimiter getting auto-corrected has a contained blast radius; a financial calculation that starts returning different results because an LLM “fixed” a schema change does not.

Three tiers

Low risk: auto-apply

These failures have predictable fixes, contained blast radius, and easy validation. The pipeline can apply the fix and continue without human involvement.

  • Delimiter corrections on ingested files
  • Character encoding issues (UTF-8 vs Latin-1, BOM handling)
  • Retries on transient errors (API timeouts, rate limits, network blips)
  • Date format adjustments on file imports
  • Header row detection (skip 0 vs skip 1)

The common thread: the fix is mechanical, the validation is straightforward (did the file parse? do the row counts match?), and getting it wrong produces an obvious error rather than subtly wrong data. A file parsed with the wrong delimiter produces garbage that fails downstream validation. A file parsed with a slightly wrong business rule produces numbers that look plausible but are incorrect. The first category is low risk. The second is not.

For low-risk fixes, the Try-Heal-Retry pattern works well. Let the LLM diagnose the issue, apply the structured fix, and validate the output. Log everything so someone can review the auto-fixes in a daily or weekly audit.

Medium risk: notify and wait

These failures need a fix, and the automation can suggest one, but a human should approve it before it’s applied.

  • Schema adaptation (new columns, type changes in source tables)
  • SQL modifications to transformation logic
  • Backfill parameter changes (adjusting lookback windows, changing partition ranges)
  • Source system connection changes (endpoint URLs, authentication parameters)

The pattern: the pipeline captures the failure context, sends it to the LLM for diagnosis, formats the suggested fix as a Slack message (or a PR), and waits for explicit approval. The automation does the investigation and drafts the solution. The human confirms it makes sense.

🔧 Suggested fix for task `load_vendor_invoices`:
Source file changed delimiter from comma to pipe.
Recommended: update delimiter config to `|`
Confidence: high (4 lines parsed successfully with new delimiter)
React ✅ to apply, ❌ to skip

The Slack-approval pattern works for teams with fast response times. For teams where approval might take hours, consider having the pipeline continue with a degraded mode (skip the problematic source, process other sources normally) rather than blocking everything while waiting for a thumbs-up.

High risk: human only

These failures should never be auto-remediated, regardless of how confident the LLM is in its diagnosis.

  • Production schema migrations
  • Anything touching financial data, compliance data, or regulated reporting
  • Data deletion or correction operations
  • Changes to primary key logic or grain changes
  • Cross-system reconciliation adjustments

The reasoning: the blast radius is too large, the validation is too complex, and the consequences of a wrong fix are too severe. An LLM that “fixes” a revenue calculation by changing a join condition might produce numbers that pass row count validation and schema checks while being fundamentally wrong. The error wouldn’t surface until someone compares the output to a known-good source, which might be weeks later.

For high-risk situations, the automation’s role is diagnosis, not remediation. Send the error context to the LLM, get a structured analysis of what went wrong and possible causes, deliver that analysis to the on-call engineer, and stop. The engineer investigates, confirms the root cause, implements the fix manually, and reviews the output. See the discussion of why AI still needs humans in data engineering for the broader argument.

Dangers that cross tiers

A few specific risks deserve attention regardless of which tier a failure falls into.

Self-healing masking code bugs

If your pipeline fails every Tuesday and the catch-up pattern silently recovers every Wednesday, you might not notice the underlying issue for weeks. The self-healing worked perfectly in the narrow sense (data gaps were filled) while hiding a systematic problem (something about Tuesday’s data or Tuesday’s processing schedule is broken).

The fix: log and alert on healed failures, not just unresolved ones. A weekly summary of “failures that self-healed” surfaces patterns that deserve investigation even though they didn’t cause downstream impact. If the same failure appears five weeks in a row and self-heals each time, that’s a bug, not a transient issue.

LLM hallucination of valid data

LLMs can generate plausible-looking output from garbage input. If a file is truly corrupt, the right answer is to fail, not to generate something that looks correct. Your structured output schema should include a cannot_fix option that stops the retry loop rather than forcing the LLM to always suggest a fix.

Without this escape hatch, the LLM will try. Given a binary file and asked for a delimiter fix, it might suggest tab or pipe or some other character that produces parseable but meaningless output. The downstream pipeline sees correctly shaped data and proceeds. The data is wrong, and nothing catches it until a human notices the reports look off.

PII exposure in diagnosis

Even sending four lines of a file to an external LLM may include personal data. Benjamin Nweke’s warning is worth repeating: “Do not send patient data to GPT-4 just to fix a comma error.”

The options for mitigating this:

  • Strip sample data to schema-only information. Send column names and types, not values. The LLM can often diagnose delimiter and encoding issues from the structural pattern alone.
  • Use local models. Run diagnosis through Ollama or another local inference engine for any pipeline that processes PII. The accuracy is lower, but the data never leaves your infrastructure.
  • Redact before sending. Replace values with type-appropriate placeholders (dates become “2026-01-01”, names become “REDACTED”, numbers stay as-is). The structural pattern is preserved while PII is removed.

The right approach depends on your data classification. Marketing event data with no PII can go to an API. Healthcare data with patient records stays local. Financial data with account numbers gets redacted. Make the classification explicit in your pipeline configuration rather than leaving it to case-by-case judgment.

Maturity context

ServiceNow’s 2025 Enterprise AI Maturity Index found fewer than 1% of organizations scored above 50 out of 100. Gartner reported 30% of AI initiatives abandoned, primarily due to data quality issues. Self-healing works for specific, well-scoped failure types — predominantly the low-risk tier — not as a blanket solution.

A practical starting approach: classify pipeline failures by risk tier, implement the low-risk tier first using the Try-Heal-Retry pattern, measure results, and expand to medium-risk with human-in-the-loop approval only after the low-risk tier behavior is validated.