#16239: fix: retry on transient API errors (overloaded, rate-limit, timeout)
agents
stale
size: M
trusted-contributor
experienced-contributor
Cluster:
Error Resilience and Retry Logic
## Summary
Fixes #16106
Sub-agent sessions previously terminated immediately on transient API errors (`overloaded_error`, 5xx, `rate_limit`, timeout) with no retry opportunity. This was especially painful for long-running batch tasks that lost all progress on a single transient failure.
## Changes
- Add `isRetryableApiError()` helper combining transient HTTP (500/502/503/529), overloaded, rate-limit, and timeout detection
- Expand agent runner retry from single attempt (transient HTTP only) to **3 attempts** covering all retryable error types
- Use escalating backoff: **2.5s → 5s → 10s**
- Non-transient errors (auth 401/403, billing 402) continue to fail immediately
- Sanitize user-facing messages for all retryable error types
## Test Plan
- [x] Unit tests for `isRetryableApiError()` covering all error categories
- [x] Integration test: recovers after `overloaded_error`
- [x] Integration test: retries up to 3 times with increasing backoff then fails gracefully
- [x] Integration test: retries on `rate_limit` errors
- [x] Integration test: does NOT retry on auth errors (401)
- [x] All 53 existing agent-runner tests pass (no regressions)
---
🤖 Generated with Claude Code (issue-hunter-pro)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Expands transient error retry logic in the agent runner from a single retry (transient HTTP only) to up to 3 retries with escalating backoff (2.5s → 5s → 10s) covering overloaded, rate-limit, and timeout errors in addition to HTTP 5xx. Non-transient errors (auth, billing) continue to fail immediately.
- Adds `isRetryableApiError()` helper in `errors.ts` composing existing `isTransientHttpError`, `isOverloadedErrorMessage`, `isRateLimitErrorMessage`, and `isTimeoutErrorMessage` detectors
- Replaces boolean `didRetryTransientHttpError` flag with counter-based `apiRetryAttempt` supporting 3 retry attempts in `agent-runner-execution.ts`
- Extends error sanitization to cover all retryable error types in the post-retry failure path
- Comprehensive test coverage: unit tests for the new helper plus integration tests for overloaded recovery, retry exhaustion with backoff, rate-limit retry, and auth error non-retry
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge — it broadens existing retry behavior with well-bounded retries, correct backoff logic, and thorough test coverage.
- The changes are well-scoped and low-risk: the new `isRetryableApiError` is a straightforward composition of existing, well-tested helpers. The retry logic in `agent-runner-execution.ts` is a clean upgrade from a boolean flag to a bounded counter with escalating backoff. Non-retryable errors (auth, billing) are correctly excluded. All edge cases are covered by tests — unit tests for the helper function and integration tests for the retry behavior including exhaustion, backoff timing, and non-retry paths. The maximum retry count (3) and backoff ceiling (10s) prevent runaway retry loops.
- No files require special attention.
<sub>Last reviewed commit: cdc8daf</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#16913: fix(agent): increase transient HTTP retry from 1 to 3 with escalati...
by hou-rong · 2026-02-15
85.7%
#8677: fix: add retry logic to OAuth token refresh
by skyblue-will · 2026-02-04
81.7%
#9232: Fix: Add automatic retry for network errors in message runs
by vishaltandale00 · 2026-02-05
81.2%
#23497: feat(retry): add retryHttpAsync utility with comprehensive coverage
by thinstripe · 2026-02-22
78.5%
#17001: fix: retry sub-agent announcements with backoff instead of silently...
by luisecab · 2026-02-15
78.5%
#9025: Fix/automatic exponential backoff for LLM rate limits
by fotorpics · 2026-02-04
78.2%
#18205: fix (agents): add periodic retry timer for failed subagent announces
by MegaPhoenix92 · 2026-02-16
77.9%
#9085: fix: improve stability for terminated responses and telegram retries
by vladdick88 · 2026-02-04
77.7%
#16307: fix: surface billing/auth FailoverErrors as user-friendly messages
by petter-b · 2026-02-14
77.3%
#12995: feat(infra): Add retry with exponential backoff for transient failures
by trevorgordon981 · 2026-02-10
77.3%