#16239: fix: retry on transient API errors (overloaded, rate-limit, timeout)

by zerone0x open 2026-02-14 14:23 View on GitHub →

agents stale size: M trusted-contributor experienced-contributor

Cluster: Error Resilience and Retry Logic

## Summary Fixes #16106 Sub-agent sessions previously terminated immediately on transient API errors (`overloaded_error`, 5xx, `rate_limit`, timeout) with no retry opportunity. This was especially painful for long-running batch tasks that lost all progress on a single transient failure. ## Changes - Add `isRetryableApiError()` helper combining transient HTTP (500/502/503/529), overloaded, rate-limit, and timeout detection - Expand agent runner retry from single attempt (transient HTTP only) to **3 attempts** covering all retryable error types - Use escalating backoff: **2.5s → 5s → 10s** - Non-transient errors (auth 401/403, billing 402) continue to fail immediately - Sanitize user-facing messages for all retryable error types ## Test Plan - [x] Unit tests for `isRetryableApiError()` covering all error categories - [x] Integration test: recovers after `overloaded_error` - [x] Integration test: retries up to 3 times with increasing backoff then fails gracefully - [x] Integration test: retries on `rate_limit` errors - [x] Integration test: does NOT retry on auth errors (401) - [x] All 53 existing agent-runner tests pass (no regressions) --- 🤖 Generated with Claude Code (issue-hunter-pro)  <h3>Greptile Summary</h3> Expands transient error retry logic in the agent runner from a single retry (transient HTTP only) to up to 3 retries with escalating backoff (2.5s → 5s → 10s) covering overloaded, rate-limit, and timeout errors in addition to HTTP 5xx. Non-transient errors (auth, billing) continue to fail immediately. - Adds `isRetryableApiError()` helper in `errors.ts` composing existing `isTransientHttpError`, `isOverloadedErrorMessage`, `isRateLimitErrorMessage`, and `isTimeoutErrorMessage` detectors - Replaces boolean `didRetryTransientHttpError` flag with counter-based `apiRetryAttempt` supporting 3 retry attempts in `agent-runner-execution.ts` - Extends error sanitization to cover all retryable error types in the post-retry failure path - Comprehensive test coverage: unit tests for the new helper plus integration tests for overloaded recovery, retry exhaustion with backoff, rate-limit retry, and auth error non-retry <h3>Confidence Score: 5/5</h3> - This PR is safe to merge — it broadens existing retry behavior with well-bounded retries, correct backoff logic, and thorough test coverage. - The changes are well-scoped and low-risk: the new `isRetryableApiError` is a straightforward composition of existing, well-tested helpers. The retry logic in `agent-runner-execution.ts` is a clean upgrade from a boolean flag to a bounded counter with escalating backoff. Non-retryable errors (auth, billing) are correctly excluded. All edge cases are covered by tests — unit tests for the helper function and integration tests for the retry behavior including exhaustion, backoff timing, and non-retry paths. The maximum retry count (3) and backoff ceiling (10s) prevent runaway retry loops. - No files require special attention. <sub>Last reviewed commit: cdc8daf</sub>