#12314: fix: treat HTTP 5xx server errors as failover-worthy
agents
stale
## Summary
When a provider returns an HTTP 5xx server error (e.g., Anthropic returning `503 No capacity available`), `classifyFailoverReason()` returns `null`, so the error is **not treated as failover-worthy**. Configured model fallbacks are never attempted — users see raw error messages or silent failures instead of automatic failover.
This PR adds `"server_error"` as a new `FailoverReason` and detects 5xx errors through both:
- **HTTP status code**: any `status >= 500` → `"server_error"`
- **Error message patterns**: `internal server error`, `bad gateway`, `service unavailable`, `no capacity available`, and status code patterns (`500`, `502`, `503`, `529`)
## Changes
| File | Change |
|------|--------|
| `pi-embedded-helpers/types.ts` | Add `"server_error"` to `FailoverReason` union |
| `pi-embedded-helpers/errors.ts` | Add `serverError` patterns + `isServerErrorMessage()` + update `classifyFailoverReason()` |
| `pi-embedded-helpers.ts` | Export `isServerErrorMessage` |
| `failover-error.ts` | Handle `status >= 500` in `resolveFailoverReasonFromError()` + map `server_error` → 503 in `resolveFailoverStatus()` |
| `failover-error.test.ts` | Add tests for 5xx status codes, error messages, and coercion |
## Test plan
- [x] All existing tests pass (`vitest run src/agents/failover-error.test.ts` — 8 tests)
- [x] New tests cover: HTTP 500/502/503/529 status codes, "no capacity available" message, "service unavailable" message, coercion with provider metadata
- [x] TypeScript compiles without errors in changed files
- [ ] Manual: configure a primary model + fallback, simulate 503 → verify fallback triggers
Fixes #8112
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR extends failover classification so HTTP 5xx responses are treated as failover-worthy via a new `server_error` reason, detected by status codes (`status >= 500`) and message patterns (e.g. “service unavailable”, “no capacity available”). It updates the failover error coercion/status mapping and adds unit tests to cover these cases.
The PR also adjusts embedded Pi subscription handling to detect native thinking blocks and relax `<final>` tag enforcement when native thinking is present.
Separately, `src/gateway/server-methods/chat.ts` was refactored to use the response-prefix template context plumbing and to register agent run context for routing; this refactor currently removes a few fields/behaviors that appear relied upon by gateway clients (see comments).
<h3>Confidence Score: 3/5</h3>
- This PR is close, but gateway chat API regressions should be fixed before merging.
- Failover classification changes look straightforward and are covered by tests, but the unrelated `chat.ts` refactor removes/changes response fields and tool-event routing behavior in ways that can break gateway/webchat clients.
- src/gateway/server-methods/chat.ts
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#21017: fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)
by taw0002 · 2026-02-19
83.8%
#21049: fix(failover): treat HTTP 5xx as rate-limit for model fallback
by maximalmargin · 2026-02-19
83.6%
#15815: Fallback LLM doesn't trigger if primary model is local
by shihanqu · 2026-02-13
83.2%
#10178: fix: trigger fallback when model resolution fails with unknown model
by Yida-Dev · 2026-02-06
82.6%
#9427: fix: trigger model fallback on all 4xx HTTP errors
by dbottme · 2026-02-05
82.5%
#7229: fix: add network error resilience to agentic loop failover
by ai-fanatic · 2026-02-02
82.1%
#11174: Fix/fried chicken error
by jfgrissom · 2026-02-07
81.8%
#5031: fix: add network connection error codes to failover classifier
by shayan919293 · 2026-01-30
81.6%
#21491: fix: classify Google 503 UNAVAILABLE as transient failover [AI-assi...
by ZPTDclaw · 2026-02-20
81.4%
#21033: fix(failover): classify connection errors as timeout for model fail...
by zerone0x · 2026-02-19
80.8%