#21017: fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)
agents
size: XS
## Problem
When the primary model's API returns **502**, **503**, or **504**, `resolveFailoverReasonFromError()` in `failover-error.ts` doesn't match any status-code branch (only 402/429/401/403/408/400 are handled). The error falls through to message-based classification via `classifyFailoverReason()`, which relies on `extractLeadingHttpStatus()` — this only works if the error *message* starts with the numeric status code (e.g. `"503 Service Unavailable ..."`).
Many API SDKs (Google, Anthropic, OpenAI) set `err.status = 503` as a property without prefixing the message string with `503`, so the message-based classifier never matches and **model failover never triggers**. The run retries the same unavailable model indefinitely.
## Fix
Add `502 || 503 || 504` to the status-code branch in `resolveFailoverReasonFromError()`, returning `"timeout"` — consistent with the existing behavior of `isTransientHttpError()` in the message-based classifier (which already includes `TRANSIENT_HTTP_ERROR_CODES = new Set([500, 502, 503, 521, 522, 523, 524, 529])`). 504 is also added since it represents a gateway timeout.
Test assertions added for all three status codes.
## Why "timeout" and not "rate_limit"?
The message-based classifier (`classifyFailoverReason`) already maps transient HTTP errors to `"timeout"` via `isTransientHttpError()` → `return "timeout"`. Using the same reason ensures consistent behavior regardless of whether the error is classified by status code or message text.
Fixes #20999
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Fixed model failover not triggering for HTTP 502/503/504 errors by adding explicit status-code branches in `resolveFailoverReasonFromError()`. These transient server errors now return `"timeout"` as the failover reason, consistent with the existing message-based classification for other transient errors.
- Added status checks for 502, 503, and 504 in `src/agents/failover-error.ts:164-166`
- Added test coverage for all three status codes
- Fix prevents runs from retrying the same unavailable model indefinitely when API SDKs set `err.status` without prefixing the message string
Minor inconsistency: 504 is now treated as transient in the status-code branch but `TRANSIENT_HTTP_ERROR_CODES` (used by message-based classification) excludes it. This means classification could differ based on error format, though the practical impact is limited since most SDKs set `err.status`.
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge with low risk
- The fix is narrowly scoped, well-tested, and addresses a clear bug where failover wasn't triggering for common transient HTTP errors. The logic is straightforward and consistent with existing patterns. Score is 4 (not 5) due to a minor inconsistency between status-code and message-based classification paths for 504 errors, though this is unlikely to cause issues in practice.
- No files require special attention
<sub>Last reviewed commit: c932dac</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#21049: fix(failover): treat HTTP 5xx as rate-limit for model fallback
by maximalmargin · 2026-02-19
88.0%
#21491: fix: classify Google 503 UNAVAILABLE as transient failover [AI-assi...
by ZPTDclaw · 2026-02-20
84.2%
#12314: fix: treat HTTP 5xx server errors as failover-worthy
by hsssgdtc · 2026-02-09
83.8%
#9427: fix: trigger model fallback on all 4xx HTTP errors
by dbottme · 2026-02-05
82.7%
#22359: fix(agents): classify overloaded service errors as timeout
by AIflow-Labs · 2026-02-21
81.2%
#21516: fix: classify connection errors as timeout for model failover (#20931)
by echoVic · 2026-02-20
80.5%
#5031: fix: add network connection error codes to failover classifier
by shayan919293 · 2026-01-30
80.0%
#21152: fix(agents): throw FailoverError for unknown model so fallback chai...
by Mellowambience · 2026-02-19
80.0%
#11821: fix(auth): trigger failover on 401 status code from expired OAuth t...
by AnonO6 · 2026-02-08
79.8%
#15815: Fallback LLM doesn't trigger if primary model is local
by shihanqu · 2026-02-13
79.8%