Proper retry configuration eliminates a large share of production incidents without requiring anomaly detection or AI tooling. Transient network errors, API rate limits, brief upstream outages, and cloud service hiccups resolve themselves if the pipeline waits and retries.
The combination of retry logic and catch-up patterns is the foundation of pipeline resilience. It’s level 1 and level 2 on the Self-healing pipeline maturity spectrum, and it’s where most teams should invest first.
Retry configuration
A good retry strategy has four components: a retry count, an initial delay, backoff behavior, and a ceiling.
In Airflow, sensible defaults look like this:
default_args = { "retries": 3, "retry_delay": timedelta(minutes=5), "retry_exponential_backoff": True, "max_retry_delay": timedelta(minutes=30), "on_failure_callback": alert_slack, "sla": timedelta(hours=2),}Each setting earns its place.
retries: 3 gives the pipeline three chances after the initial failure. Most transient errors resolve within one or two retries. Going higher than 5 rarely helps and can delay failure detection if the issue is persistent.
retry_delay with retry_exponential_backoff prevents the pipeline from hammering a struggling service. The first retry waits 5 minutes, the second waits 10, the third waits 20. Exponential backoff is particularly important for API rate limits, where retrying immediately just extends the rate-limiting window.
max_retry_delay caps the backoff so the third or fourth retry doesn’t wait hours. Thirty minutes is a reasonable ceiling for most batch pipelines. For near-real-time pipelines, you’d lower this significantly.
on_failure_callback fires only after all retries are exhausted. This is the right place for a Slack alert, not after every individual retry attempt. The person on call should hear about failures that the pipeline couldn’t resolve on its own, not about transient blips that resolved on retry two. See Pipeline Alerting Delivery Patterns for how to structure these alerts so they’re actionable.
sla sets a deadline for the entire task. If the task (including retries) hasn’t completed within 2 hours, it triggers an SLA miss alert. This catches the scenario where retries are technically succeeding but the pipeline is stuck in a retry loop that’s going to blow past your data freshness commitments.
What retries don’t handle
Retries work for transient, external errors. They don’t work for:
- Logic errors in your transformation code. If your SQL has a bug, retrying it three times produces three identical failures.
- Schema changes. If the upstream table dropped a column, no amount of retrying will bring it back. That’s where schema drift adaptation takes over.
- Data quality issues. If the source is sending corrupt or malformed data, retries just re-read the same corrupt data. This is where the Try-Heal-Retry pattern with AI diagnosis becomes relevant, or where explicit validation layers catch the problem.
The distinction matters because it determines what happens after retries fail. A transient error that exhausts retries probably indicates a longer outage and deserves an alert. A persistent error that fails identically on every retry needs investigation, not more retries.
The catch-up pattern
Catch-up goes beyond retries. Instead of just re-running a failed task, catch-up processes data from the last successful checkpoint forward. If Monday’s run failed and Tuesday’s succeeds, the catch-up pattern ensures Tuesday’s run picks up both Monday’s and Tuesday’s data.
For dbt incremental models, this is built in. The is_incremental() logic compares against the last materialized timestamp. If a run fails and the table isn’t updated, the next successful run’s WHERE clause captures everything since the last successful materialization.
{% if is_incremental() %}WHERE event_timestamp > (SELECT MAX(event_timestamp) FROM {{ this }}){% endif %}If Monday’s run failed, MAX(event_timestamp) in the target table still points to Sunday. Tuesday’s run picks up Monday and Tuesday’s data automatically. No manual intervention, no gap in the data.
Airflow has native catch-up support too. Setting catchup=True on a DAG means that if the DAG was paused and unpaused three days later, Airflow schedules runs for all three missed intervals. Combined with the dbt incremental pattern, this creates a double safety net: the orchestrator catches missed runs, and the transformation layer catches missed data.
Idempotency as a prerequisite
Catch-up only works if your pipeline is idempotent: running the same operation twice produces the same result as running it once. If reprocessing a time window creates duplicates or double-counts values, catch-up makes things worse rather than better.
For dbt, idempotency depends on your incremental strategy. A merge strategy with a proper unique_key is naturally idempotent because duplicate records get updated rather than inserted. An insert_overwrite strategy is idempotent because it replaces entire partitions. An append strategy is not idempotent, and using it with catch-up logic creates duplicates.
The explicit deduplication pattern in your SELECT handles the edge cases:
QUALIFY ROW_NUMBER() OVER ( PARTITION BY event_id ORDER BY updated_at DESC) = 1This ensures that even if source data contains duplicates or the pipeline reprocesses overlapping windows, the output stays clean.
Combining retries with alerting
The goal is a pipeline that handles common failures silently and alerts on uncommon ones. The configuration pattern:
- Retries handle transient errors. Three retries with exponential backoff.
on_failure_callbackalerts after exhausted retries. The team sees only failures that need human attention.- SLA monitoring catches slow degradation. Even if every individual retry succeeds, breaching the SLA deadline signals a problem.
- Catch-up fills gaps automatically. The next successful run processes missed data without manual backfill.
This four-layer approach handles the majority of production incidents for most data teams. The on-call process only activates for failures that survive all four layers, which means the on-call engineer deals with genuinely novel problems rather than routine transient errors.
Where this setup falls short is in failure types that retries can’t resolve and catch-up can’t compensate for: changed file formats, encoding issues, new data shapes that the transformation doesn’t expect. That’s the territory for the higher levels of self-healing, from schema drift adaptation through AI-powered remediation.