#22359: fix(agents): classify overloaded service errors as timeout
agents
size: XS
Cluster:
Quota Management Improvements
## What changed
- Updated failover classification so overloaded/service-unavailable text is mapped to `timeout` instead of `rate_limit` in `src/agents/pi-embedded-helpers/errors.ts`.
- Updated `classifyFailoverReason` e2e coverage in `src/agents/pi-embedded-helpers.isbillingerrormessage.e2e.test.ts` to lock this behavior.
## Why this fixes the issue
Issue #22294 is caused by high-demand/service-unavailable messages being treated as `rate_limit`, which triggers cooldown escalation. This change treats those as transient `timeout` failures, so retry behavior is used instead of quota-based cooldown.
## Tests run
- `pnpm vitest run --config vitest.e2e.config.ts src/agents/pi-embedded-helpers.isbillingerrormessage.e2e.test.ts src/agents/failover-error.e2e.test.ts`
## Edge cases
- Covers both plain text and payload-based phrases (`high demand`, `service unavailable`, `overloaded_error`) and verifies they now classify as `timeout`.
- HTTP `500/503` status-prefixed failures remain handled by transient HTTP logic as `timeout`.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Updated failover classification to treat overloaded/service-unavailable errors as `timeout` instead of `rate_limit`. This prevents these transient service capacity issues from triggering quota-based cooldown escalation, allowing proper retry behavior for temporary provider overload situations.
- Changed `isOverloadedErrorMessage()` classification from `rate_limit` to `timeout` in `src/agents/pi-embedded-helpers/errors.ts:816-818`
- Updated test expectations in `src/agents/pi-embedded-helpers.isbillingerrormessage.e2e.test.ts` to verify the new classification
- Covers "overloaded_error" JSON payloads, "high demand", "service unavailable", and "overloaded" text patterns
- Both `timeout` and `rate_limit` use the same exponential backoff formula (5^errorCount, max 1hr), but the semantic difference helps downstream retry strategies distinguish between quota exhaustion and temporary capacity issues
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk
- The change is a simple, focused reclassification that correctly maps service overload errors to transient timeout behavior. Test coverage comprehensively validates the new classification for all overload patterns. The logic change is minimal (1 line) with clear intent, and both timeout and rate_limit use identical cooldown mechanics, so there's no risk of breaking retry behavior.
- No files require special attention
<sub>Last reviewed commit: 07dd6ec</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#11170: fix: classify subscription quota limit errors as rate_limit for fai...
by Yida-Dev · 2026-02-07
81.5%
#21017: fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)
by taw0002 · 2026-02-19
81.2%
#21516: fix: classify connection errors as timeout for model failover (#20931)
by echoVic · 2026-02-20
80.8%
#21491: fix: classify Google 503 UNAVAILABLE as transient failover [AI-assi...
by ZPTDclaw · 2026-02-20
80.5%
#5031: fix: add network connection error codes to failover classifier
by shayan919293 · 2026-01-30
80.4%
#21033: fix(failover): classify connection errors as timeout for model fail...
by zerone0x · 2026-02-19
80.0%
#6014: Agents: improve quota exhaustion fallback detection
by erain · 2026-02-01
79.4%
#12314: fix: treat HTTP 5xx server errors as failover-worthy
by hsssgdtc · 2026-02-09
79.4%
#23520: fix: trigger failover on Anthropic insufficient_quota (HTTP 400) (#...
by dissaozw · 2026-02-22
79.4%
#15815: Fallback LLM doesn't trigger if primary model is local
by shihanqu · 2026-02-13
78.8%