← Back to PRs

#22359: fix(agents): classify overloaded service errors as timeout

by AIflow-Labs open 2026-02-21 02:26 View on GitHub →
agents size: XS
## What changed - Updated failover classification so overloaded/service-unavailable text is mapped to `timeout` instead of `rate_limit` in `src/agents/pi-embedded-helpers/errors.ts`. - Updated `classifyFailoverReason` e2e coverage in `src/agents/pi-embedded-helpers.isbillingerrormessage.e2e.test.ts` to lock this behavior. ## Why this fixes the issue Issue #22294 is caused by high-demand/service-unavailable messages being treated as `rate_limit`, which triggers cooldown escalation. This change treats those as transient `timeout` failures, so retry behavior is used instead of quota-based cooldown. ## Tests run - `pnpm vitest run --config vitest.e2e.config.ts src/agents/pi-embedded-helpers.isbillingerrormessage.e2e.test.ts src/agents/failover-error.e2e.test.ts` ## Edge cases - Covers both plain text and payload-based phrases (`high demand`, `service unavailable`, `overloaded_error`) and verifies they now classify as `timeout`. - HTTP `500/503` status-prefixed failures remain handled by transient HTTP logic as `timeout`. <!-- greptile_comment --> <h3>Greptile Summary</h3> Updated failover classification to treat overloaded/service-unavailable errors as `timeout` instead of `rate_limit`. This prevents these transient service capacity issues from triggering quota-based cooldown escalation, allowing proper retry behavior for temporary provider overload situations. - Changed `isOverloadedErrorMessage()` classification from `rate_limit` to `timeout` in `src/agents/pi-embedded-helpers/errors.ts:816-818` - Updated test expectations in `src/agents/pi-embedded-helpers.isbillingerrormessage.e2e.test.ts` to verify the new classification - Covers "overloaded_error" JSON payloads, "high demand", "service unavailable", and "overloaded" text patterns - Both `timeout` and `rate_limit` use the same exponential backoff formula (5^errorCount, max 1hr), but the semantic difference helps downstream retry strategies distinguish between quota exhaustion and temporary capacity issues <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk - The change is a simple, focused reclassification that correctly maps service overload errors to transient timeout behavior. Test coverage comprehensively validates the new classification for all overload patterns. The logic change is minimal (1 line) with clear intent, and both timeout and rate_limit use identical cooldown mechanics, so there's no risk of breaking retry behavior. - No files require special attention <sub>Last reviewed commit: 07dd6ec</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs