#21491: fix: classify Google 503 UNAVAILABLE as transient failover [AI-assisted]
agents
size: S
## Summary
Fixes a failover classification gap where Google AI SDK (and Vertex AI) responses can return JSON-wrapped 503 errors like:
```json
{"error":{"message":"The model is overloaded.","code":503,"status":"UNAVAILABLE"}}
```
These payloads were not detected by `isTransientHttpError` (which only matched `HTTP/1.1 503`-style leading status lines), so failover cascades stalled instead of advancing to the next candidate.
### Changes
- **`pi-embedded-helpers/errors.ts`**: Add `extractEmbeddedHttpCode(raw)` that parses JSON-wrapped API error payloads and returns the numeric `error.code`. Update `isTransientHttpError` to fall through to this helper when no leading HTTP status is found.
- **`failover-error.ts`**: Add clarifying comment to the existing 503→`"timeout"` mapping.
- Tests updated to reflect correct behavior: JSON-wrapped 503 now classifies as `"timeout"` (not `"rate_limit"`) because `isTransientHttpError` fires before `isOverloadedErrorMessage`.
> **Note:** Per reviewer feedback on prior PR #20805 — the error struct is shared with Vertex AI; both surfaces are covered by the same `extractEmbeddedHttpCode` path and documented in comments/tests.
### Why
Issue surfaced from live council-run artifacts where the Gemini seat repeatedly returned `code: 503 / status: UNAVAILABLE` without cascading to fallback candidates. A script-level workaround exists, but this is the proper gateway-level fix.
### Testing
AI-assisted PR: yes (ZPTDclaw / Claude Code + manual conflict resolution).
Degree of testing: targeted + regression.
- `pnpm vitest run --config vitest.e2e.config.ts src/agents/failover-error.e2e.test.ts src/agents/pi-embedded-helpers.isbillingerrormessage.e2e.test.ts` ✅ (44/44 passed)
- `pnpm build` ✅
### Prior art
Supersedes closed PR #20805 (same fix, rebased on current main + conflict resolution).
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Fixes failover classification gap where Google AI SDK and Vertex AI JSON-wrapped 503 errors were not being detected as transient, causing failover cascades to stall. The PR adds `extractEmbeddedHttpCode()` helper that parses `error.code` from JSON payloads and updates `isTransientHttpError()` to use it as a fallback when no leading HTTP status is found.
**Key behavior change**: JSON-wrapped 503 errors now classify as `"timeout"` (not `"rate_limit"`) because `isTransientHttpError` fires before `isOverloadedErrorMessage` in the classification chain.
- Comprehensive test coverage for both Google AI SDK and Vertex AI error formats
- Tests verify 503 classifies as timeout while 429 does not (rate limit vs transient distinction preserved)
- Code follows existing patterns and properly validates payloads before extraction
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk.
- Score reflects well-tested targeted fix with clear intent, thorough test coverage including edge cases, and alignment with existing error handling patterns. The behavior change is intentional and documented.
- No files require special attention.
<sub>Last reviewed commit: 019405b</sub>
<!-- greptile_other_comments_section -->
<sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#21017: fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)
by taw0002 · 2026-02-19
84.2%
#12314: fix: treat HTTP 5xx server errors as failover-worthy
by hsssgdtc · 2026-02-09
81.4%
#22359: fix(agents): classify overloaded service errors as timeout
by AIflow-Labs · 2026-02-21
80.5%
#7229: fix: add network error resilience to agentic loop failover
by ai-fanatic · 2026-02-02
78.9%
#11821: fix(auth): trigger failover on 401 status code from expired OAuth t...
by AnonO6 · 2026-02-08
78.0%
#23520: fix: trigger failover on Anthropic insufficient_quota (HTTP 400) (#...
by dissaozw · 2026-02-22
77.3%
#6464: fix: trigger model failover on malformed tool-call JSON
by ai-fanatic · 2026-02-01
77.2%
#15815: Fallback LLM doesn't trigger if primary model is local
by shihanqu · 2026-02-13
77.1%
#12687: fix: handle empty LLM stream response with failover
by janckerchen · 2026-02-09
76.8%
#21049: fix(failover): treat HTTP 5xx as rate-limit for model fallback
by maximalmargin · 2026-02-19
76.8%