Agent Architecture Patterns

Error Recovery

Error recovery patterns determine how agents detect, respond to, and continue working after failures during execution, from tool call errors and malformed outputs to reasoning dead-ends and infinite loops. Common recovery strategies include retry with backoff for transient failures, fallback models when one provider is unavailable, context truncation when hitting window limits, and graceful degradation that completes partial work rather than failing entirely. The central architectural decision is whether to let the agent self-recover by including the error in its context and asking it to reason about alternatives, or to handle failures programmatically in the host application, and in production systems this choice determines the difference between 60% and 99% task completion rates.