#16913: fix(agent): increase transient HTTP retry from 1 to 3 with escalating delays

by hou-rong open 2026-02-15 07:31 View on GitHub →

size: XS

Cluster: Error Resilience and Retry Logic

Fixes #16911 ## Problem The single retry with a fixed 2.5s delay is insufficient when the upstream LLM proxy experiences brief outages (e.g. `503 All endpoints failed`). Both the original request and the single retry fail within ~3 seconds, and the raw error is surfaced to the user. ## Solution Replace the boolean `didRetryTransientHttpError` flag with a counter `transientHttpRetryCount`, allowing up to 3 retries with linearly increasing delays: | Retry | Delay | |-------|-------| | 1st | 2.5s | | 2nd | 5.0s | | 3rd | 7.5s | Total tolerance window increases from ~2.5s to ~15s, giving the proxy enough time to recover from brief disruptions. ## Changes - `src/auto-reply/reply/agent-runner-execution.ts` - `TRANSIENT_HTTP_RETRY_DELAY_MS` → `TRANSIENT_HTTP_RETRY_BASE_DELAY_MS` + `TRANSIENT_HTTP_MAX_RETRIES` - `didRetryTransientHttpError` (boolean) → `transientHttpRetryCount` (number) - Delay formula: `baseDelay * retryCount` (linear escalation) - Log message now includes retry progress: `Retry 1/3 in 2500ms`  <h3>Greptile Summary</h3> Increases transient HTTP error retry tolerance from a single retry with a fixed 2.5s delay to up to 3 retries with linearly escalating delays (2.5s, 5s, 7.5s), extending the total recovery window from ~2.5s to ~15s. This addresses issue #16911 where brief upstream LLM proxy outages (e.g. `503 All endpoints failed`) would surface raw errors to users because the single retry was insufficient. - Replaces boolean `didRetryTransientHttpError` flag with a `transientHttpRetryCount` counter - Introduces `TRANSIENT_HTTP_MAX_RETRIES` (3) and renames the delay constant to `TRANSIENT_HTTP_RETRY_BASE_DELAY_MS` - Delay formula uses linear escalation: `baseDelay * retryCount` - Log message now includes retry progress (e.g. `Retry 1/3 in 2500ms`) <h3>Confidence Score: 5/5</h3> - This PR is safe to merge — it's a small, well-scoped change to retry behavior with no risk to existing functionality. - The change is minimal (single file, ~15 lines changed), the retry logic is mathematically correct, the counter-based approach is a straightforward extension of the previous boolean flag pattern, and the fallthrough to error handling after max retries is preserved. No new edge cases or failure modes are introduced. - No files require special attention. <sub>Last reviewed commit: fc97bb1</sub>  <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>