#16913: fix(agent): increase transient HTTP retry from 1 to 3 with escalating delays
size: XS
Cluster:
Error Resilience and Retry Logic
Fixes #16911
## Problem
The single retry with a fixed 2.5s delay is insufficient when the upstream LLM proxy experiences brief outages (e.g. `503 All endpoints failed`). Both the original request and the single retry fail within ~3 seconds, and the raw error is surfaced to the user.
## Solution
Replace the boolean `didRetryTransientHttpError` flag with a counter `transientHttpRetryCount`, allowing up to 3 retries with linearly increasing delays:
| Retry | Delay |
|-------|-------|
| 1st | 2.5s |
| 2nd | 5.0s |
| 3rd | 7.5s |
Total tolerance window increases from ~2.5s to ~15s, giving the proxy enough time to recover from brief disruptions.
## Changes
- `src/auto-reply/reply/agent-runner-execution.ts`
- `TRANSIENT_HTTP_RETRY_DELAY_MS` → `TRANSIENT_HTTP_RETRY_BASE_DELAY_MS` + `TRANSIENT_HTTP_MAX_RETRIES`
- `didRetryTransientHttpError` (boolean) → `transientHttpRetryCount` (number)
- Delay formula: `baseDelay * retryCount` (linear escalation)
- Log message now includes retry progress: `Retry 1/3 in 2500ms`
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Increases transient HTTP error retry tolerance from a single retry with a fixed 2.5s delay to up to 3 retries with linearly escalating delays (2.5s, 5s, 7.5s), extending the total recovery window from ~2.5s to ~15s. This addresses issue #16911 where brief upstream LLM proxy outages (e.g. `503 All endpoints failed`) would surface raw errors to users because the single retry was insufficient.
- Replaces boolean `didRetryTransientHttpError` flag with a `transientHttpRetryCount` counter
- Introduces `TRANSIENT_HTTP_MAX_RETRIES` (3) and renames the delay constant to `TRANSIENT_HTTP_RETRY_BASE_DELAY_MS`
- Delay formula uses linear escalation: `baseDelay * retryCount`
- Log message now includes retry progress (e.g. `Retry 1/3 in 2500ms`)
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge — it's a small, well-scoped change to retry behavior with no risk to existing functionality.
- The change is minimal (single file, ~15 lines changed), the retry logic is mathematically correct, the counter-based approach is a straightforward extension of the previous boolean flag pattern, and the fallthrough to error handling after max retries is preserved. No new edge cases or failure modes are introduced.
- No files require special attention.
<sub>Last reviewed commit: fc97bb1</sub>
<!-- greptile_other_comments_section -->
<sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#16239: fix: retry on transient API errors (overloaded, rate-limit, timeout)
by zerone0x · 2026-02-14
85.7%
#9025: Fix/automatic exponential backoff for LLM rate limits
by fotorpics · 2026-02-04
77.7%
#9232: Fix: Add automatic retry for network errors in message runs
by vishaltandale00 · 2026-02-05
77.0%
#8677: fix: add retry logic to OAuth token refresh
by skyblue-will · 2026-02-04
77.0%
#17001: fix: retry sub-agent announcements with backoff instead of silently...
by luisecab · 2026-02-15
76.4%
#11472: fix: retry media fetch on transient network errors
by openclaw-quenio · 2026-02-07
76.3%
#23497: feat(retry): add retryHttpAsync utility with comprehensive coverage
by thinstripe · 2026-02-22
76.2%
#12995: feat(infra): Add retry with exponential backoff for transient failures
by trevorgordon981 · 2026-02-10
75.7%
#9085: fix: improve stability for terminated responses and telegram retries
by vladdick88 · 2026-02-04
74.9%
#17028: fix(subagent): retry announce on timeout
by Limitless2023 · 2026-02-15
74.3%