Try-Heal-Retry pattern

When a pipeline failure can’t be resolved by simple retries or schema adaptation, the next level of self-healing involves sending failure context to an LLM and getting back a structured fix. Benjamin Nweke’s architecture (published in Towards Data Science, January 2026) lays out a clean reference for this pattern.

The pattern has five steps:

The pipeline attempts processing normally
On failure, it captures “crime scene evidence”: the traceback, the first few lines of the input data, and relevant metadata
It sends this context to an LLM with a structured output schema
The LLM returns corrective parameters as JSON (not free-text suggestions, but actionable configuration changes)
The pipeline retries with the corrections applied

The design choices in each step matter more than the pattern’s overall shape.

Structured output is non-negotiable

The most important design decision is forcing structured JSON responses from the LLM. Free-text advice like “try changing the delimiter to a pipe character” is useless for automated remediation. You need a response the pipeline can parse and apply without human interpretation.

Pydantic models define the fix schema:

from pydantic import BaseModel

class PipelineFix(BaseModel):
    delimiter: str | None = None
    encoding: str | None = None
    skip_rows: int = 0
    date_format: str | None = None
    explanation: str

Every field in this schema is a parameter the pipeline knows how to apply. The LLM can’t suggest arbitrary changes. It can only fill in values for parameters the pipeline already handles. This constraint prevents the hallucination problem where the model returns plausible-sounding but unstructured advice that nobody acts on.

The explanation field is for logging, not for automated action. It tells the on-call engineer what the LLM thought was wrong, which is useful for reviewing auto-remediated failures after the fact.

The diagnosis function

The diagnosis function captures failure context and sends it to the LLM with minimal data exposure:

def diagnose_failure(error: Exception, sample_data: str) -> PipelineFix:
    """Send failure context to Claude, get structured fix."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Pipeline failed with: {error}\n\n"
                       f"First 4 lines of input:\n{sample_data}\n\n"
                       f"Return a JSON fix."
        }],
    )
    return PipelineFix.model_validate_json(response.content[0].text)

Only sending the first four lines of any file is a deliberate choice. It keeps API costs down while avoiding sending entire datasets (with potential PII) to external services. Four lines is usually enough for the LLM to identify encoding issues, delimiter mismatches, or date format problems. For sensitive data, even four lines might be too much. See Self-healing risk tiering for the PII considerations.

The retry loop

Tenacity handles clean retry logic with before_sleep callbacks for logging:

from tenacity import retry, stop_after_attempt, before_sleep

@retry(stop=stop_after_attempt(3), before_sleep=log_retry)
def process_file(filepath: str, fix: PipelineFix | None = None):
    try:
        load_and_transform(filepath, fix)
    except Exception as e:
        sample = read_first_lines(filepath, n=4)
        fix = diagnose_failure(e, sample)
        raise  # Tenacity retries with the fix available

The pattern separates the retry mechanism (Tenacity) from the diagnosis mechanism (the LLM call). On each retry, the pipeline has an updated fix object with parameters the LLM suggested. If the fix works, processing succeeds. If it doesn’t, the next iteration captures the new error and asks for a different fix.

Circuit breakers

This is the piece that often gets forgotten. If 100,000 files fail simultaneously (say, a source system changed its export format), you don’t want 100,000 LLM API calls. A simple counter that trips after N failures in a time window saves you from a surprise bill.

class CircuitBreaker:
    def __init__(self, max_failures=10, window_seconds=300):
        self.failures = []
        self.max_failures = max_failures
        self.window_seconds = window_seconds

    def record_failure(self):
        now = time.time()
        self.failures = [t for t in self.failures if now - t < self.window_seconds]
        self.failures.append(now)
        if len(self.failures) >= self.max_failures:
            raise CircuitBreakerTripped(
                f"{self.max_failures} failures in {self.window_seconds}s"
            )

When the circuit breaker trips, the pipeline falls back to traditional failure handling: log the error, alert the team, stop processing. The LLM is not the right tool for mass failures that all have the same root cause.

For high-volume, low-stakes remediation where you’d rather trade accuracy for cost control, local models via Ollama are an alternative to API calls. The diagnosis quality is lower, but the cost per call is essentially zero.

Production implementations with Claude

Several teams have taken this pattern (or variations of it) into production specifically with Claude.

Michael Stewart’s Datadog + Claude Code integration walks through a full setup. Datadog monitors detect error patterns and trigger webhooks. An API Gateway and Lambda function fetches detailed error logs with stack traces, groups errors by type, and sends them to Claude Code as a batch job. Claude Code clones the repo, analyzes the errors in context, and generates fix suggestions. The team gets a Slack notification with the analysis, applies the fix through Cursor, and Claude Code reviews the resulting PR before deployment. Every step includes human checkpoints.

Monte Carlo uses Claude 3.5 for two agent types. A Monitoring Agent profiles data and creates monitoring rules (with a 60% acceptance rate from early adopters including Texas Rangers and Roche). A Troubleshooting Agent autonomously drills into root causes. The agents run entirely in Monte Carlo’s environment, so customers don’t need separate LLM subscriptions or worry about data leaving the platform.

For teams building their own integration, a good starting point is Airflow’s on_failure_callback connected to a Cloud Function that calls the Claude API. When a task fails, the callback sends the error context to the Cloud Function, which calls Claude with the task’s error log and returns a structured diagnosis to Slack. This doesn’t auto-fix anything. It gives the on-call engineer a head start on understanding what went wrong before they even open their laptop.

The “cannot fix” escape hatch

Your structured output schema should include a way for the LLM to say “I can’t fix this.” If the input is truly corrupt, the right answer is to fail, not to generate plausible-looking output.

class PipelineFix(BaseModel):
    can_fix: bool = True
    delimiter: str | None = None
    encoding: str | None = None
    skip_rows: int = 0
    date_format: str | None = None
    explanation: str

When can_fix is False, the retry loop stops and the pipeline falls back to human investigation. Without this escape hatch, the LLM might hallucinate valid-looking data from garbage input. A file full of binary data doesn’t have a “right” delimiter. An LLM that’s forced to suggest one will pick something and let the pipeline produce wrong output.

When to use this pattern

The Try-Heal-Retry pattern makes sense for failure types where three conditions hold:

The fixes are predictable. Encoding changes, delimiter switches, date format mismatches, header row variations. You can enumerate the possible fixes in your Pydantic schema.
The risk is contained. Getting the fix wrong produces bad data in a low-stakes context, not in a financial report or compliance system. See Self-healing risk tiering for the framework.
You can validate the fix. After remediation, you can check row counts, schema conformance, and basic data quality to confirm the fix actually worked.

File ingestion is the canonical use case. API response parsing is another good candidate. Both have predictable failure modes, contained risk, and easy validation.

What doesn’t fit: SQL transformation errors, business logic bugs, production schema migrations, anything touching financial data. These need human judgment, not automated remediation. The Self-healing pipeline maturity spectrum places this distinction at the boundary between level 4 (practical) and level 5 (theoretical), and the boundary exists for good reason.