#21017: fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)

by taw0002 open 2026-02-19 15:11 View on GitHub →

agents size: XS

Cluster: Connection Error Handling Improvements

## Problem When the primary model's API returns **502**, **503**, or **504**, `resolveFailoverReasonFromError()` in `failover-error.ts` doesn't match any status-code branch (only 402/429/401/403/408/400 are handled). The error falls through to message-based classification via `classifyFailoverReason()`, which relies on `extractLeadingHttpStatus()` — this only works if the error *message* starts with the numeric status code (e.g. `"503 Service Unavailable ..."`). Many API SDKs (Google, Anthropic, OpenAI) set `err.status = 503` as a property without prefixing the message string with `503`, so the message-based classifier never matches and **model failover never triggers**. The run retries the same unavailable model indefinitely. ## Fix Add `502 || 503 || 504` to the status-code branch in `resolveFailoverReasonFromError()`, returning `"timeout"` — consistent with the existing behavior of `isTransientHttpError()` in the message-based classifier (which already includes `TRANSIENT_HTTP_ERROR_CODES = new Set([500, 502, 503, 521, 522, 523, 524, 529])`). 504 is also added since it represents a gateway timeout. Test assertions added for all three status codes. ## Why "timeout" and not "rate_limit"? The message-based classifier (`classifyFailoverReason`) already maps transient HTTP errors to `"timeout"` via `isTransientHttpError()` → `return "timeout"`. Using the same reason ensures consistent behavior regardless of whether the error is classified by status code or message text. Fixes #20999  <h3>Greptile Summary</h3> Fixed model failover not triggering for HTTP 502/503/504 errors by adding explicit status-code branches in `resolveFailoverReasonFromError()`. These transient server errors now return `"timeout"` as the failover reason, consistent with the existing message-based classification for other transient errors. - Added status checks for 502, 503, and 504 in `src/agents/failover-error.ts:164-166` - Added test coverage for all three status codes - Fix prevents runs from retrying the same unavailable model indefinitely when API SDKs set `err.status` without prefixing the message string Minor inconsistency: 504 is now treated as transient in the status-code branch but `TRANSIENT_HTTP_ERROR_CODES` (used by message-based classification) excludes it. This means classification could differ based on error format, though the practical impact is limited since most SDKs set `err.status`. <h3>Confidence Score: 4/5</h3> - This PR is safe to merge with low risk - The fix is narrowly scoped, well-tested, and addresses a clear bug where failover wasn't triggering for common transient HTTP errors. The logic is straightforward and consistent with existing patterns. Score is 4 (not 5) due to a minor inconsistency between status-code and message-based classification paths for 504 errors, though this is unlikely to cause issues in practice. - No files require special attention <sub>Last reviewed commit: c932dac</sub>