← Back to PRs

#12314: fix: treat HTTP 5xx server errors as failover-worthy

by hsssgdtc open 2026-02-09 04:00 View on GitHub →
agents stale
## Summary When a provider returns an HTTP 5xx server error (e.g., Anthropic returning `503 No capacity available`), `classifyFailoverReason()` returns `null`, so the error is **not treated as failover-worthy**. Configured model fallbacks are never attempted — users see raw error messages or silent failures instead of automatic failover. This PR adds `"server_error"` as a new `FailoverReason` and detects 5xx errors through both: - **HTTP status code**: any `status >= 500` → `"server_error"` - **Error message patterns**: `internal server error`, `bad gateway`, `service unavailable`, `no capacity available`, and status code patterns (`500`, `502`, `503`, `529`) ## Changes | File | Change | |------|--------| | `pi-embedded-helpers/types.ts` | Add `"server_error"` to `FailoverReason` union | | `pi-embedded-helpers/errors.ts` | Add `serverError` patterns + `isServerErrorMessage()` + update `classifyFailoverReason()` | | `pi-embedded-helpers.ts` | Export `isServerErrorMessage` | | `failover-error.ts` | Handle `status >= 500` in `resolveFailoverReasonFromError()` + map `server_error` → 503 in `resolveFailoverStatus()` | | `failover-error.test.ts` | Add tests for 5xx status codes, error messages, and coercion | ## Test plan - [x] All existing tests pass (`vitest run src/agents/failover-error.test.ts` — 8 tests) - [x] New tests cover: HTTP 500/502/503/529 status codes, "no capacity available" message, "service unavailable" message, coercion with provider metadata - [x] TypeScript compiles without errors in changed files - [ ] Manual: configure a primary model + fallback, simulate 503 → verify fallback triggers Fixes #8112 <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR extends failover classification so HTTP 5xx responses are treated as failover-worthy via a new `server_error` reason, detected by status codes (`status >= 500`) and message patterns (e.g. “service unavailable”, “no capacity available”). It updates the failover error coercion/status mapping and adds unit tests to cover these cases. The PR also adjusts embedded Pi subscription handling to detect native thinking blocks and relax `<final>` tag enforcement when native thinking is present. Separately, `src/gateway/server-methods/chat.ts` was refactored to use the response-prefix template context plumbing and to register agent run context for routing; this refactor currently removes a few fields/behaviors that appear relied upon by gateway clients (see comments). <h3>Confidence Score: 3/5</h3> - This PR is close, but gateway chat API regressions should be fixed before merging. - Failover classification changes look straightforward and are covered by tests, but the unrelated `chat.ts` refactor removes/changes response fields and tool-event routing behavior in ways that can break gateway/webchat clients. - src/gateway/server-methods/chat.ts <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs