#21491: fix: classify Google 503 UNAVAILABLE as transient failover [AI-assisted]

by ZPTDclaw open 2026-02-20 01:01 View on GitHub →

agents size: S

Cluster: Connection Error Handling Improvements

## Summary Fixes a failover classification gap where Google AI SDK (and Vertex AI) responses can return JSON-wrapped 503 errors like: ```json {"error":{"message":"The model is overloaded.","code":503,"status":"UNAVAILABLE"}} ``` These payloads were not detected by `isTransientHttpError` (which only matched `HTTP/1.1 503`-style leading status lines), so failover cascades stalled instead of advancing to the next candidate. ### Changes - **`pi-embedded-helpers/errors.ts`**: Add `extractEmbeddedHttpCode(raw)` that parses JSON-wrapped API error payloads and returns the numeric `error.code`. Update `isTransientHttpError` to fall through to this helper when no leading HTTP status is found. - **`failover-error.ts`**: Add clarifying comment to the existing 503→`"timeout"` mapping. - Tests updated to reflect correct behavior: JSON-wrapped 503 now classifies as `"timeout"` (not `"rate_limit"`) because `isTransientHttpError` fires before `isOverloadedErrorMessage`. > **Note:** Per reviewer feedback on prior PR #20805 — the error struct is shared with Vertex AI; both surfaces are covered by the same `extractEmbeddedHttpCode` path and documented in comments/tests. ### Why Issue surfaced from live council-run artifacts where the Gemini seat repeatedly returned `code: 503 / status: UNAVAILABLE` without cascading to fallback candidates. A script-level workaround exists, but this is the proper gateway-level fix. ### Testing AI-assisted PR: yes (ZPTDclaw / Claude Code + manual conflict resolution). Degree of testing: targeted + regression. - `pnpm vitest run --config vitest.e2e.config.ts src/agents/failover-error.e2e.test.ts src/agents/pi-embedded-helpers.isbillingerrormessage.e2e.test.ts` ✅ (44/44 passed) - `pnpm build` ✅ ### Prior art Supersedes closed PR #20805 (same fix, rebased on current main + conflict resolution).  <h3>Greptile Summary</h3> Fixes failover classification gap where Google AI SDK and Vertex AI JSON-wrapped 503 errors were not being detected as transient, causing failover cascades to stall. The PR adds `extractEmbeddedHttpCode()` helper that parses `error.code` from JSON payloads and updates `isTransientHttpError()` to use it as a fallback when no leading HTTP status is found. **Key behavior change**: JSON-wrapped 503 errors now classify as `"timeout"` (not `"rate_limit"`) because `isTransientHttpError` fires before `isOverloadedErrorMessage` in the classification chain. - Comprehensive test coverage for both Google AI SDK and Vertex AI error formats - Tests verify 503 classifies as timeout while 429 does not (rate limit vs transient distinction preserved) - Code follows existing patterns and properly validates payloads before extraction <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk. - Score reflects well-tested targeted fix with clear intent, thorough test coverage including edge cases, and alignment with existing error handling patterns. The behavior change is intentional and documented. - No files require special attention. <sub>Last reviewed commit: 019405b</sub>  <sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub>