← Back to PRs

#23564: feat(auth): add timeout retry before auth profile rotation

by echoVic open 2026-02-22 12:55 View on GitHub →
agents size: M trusted-contributor
## Problem Timeouts are currently treated as strong rate-limit signals, causing immediate auth profile cooldown and rotation. This is too aggressive for transient network issues, slow responses, or temporary provider latency. **Current behavior:** 1. Request times out 2. Profile immediately marked as failed with exponential cooldown (1m → 5m → 25m → 1h) 3. Rotate to next auth profile 4. If all profiles exhausted → trigger model/provider failover **Impact:** Temporary connectivity issues can exhaust all auth profiles and cascade into unnecessary model failover. ## Solution Add configurable retry-with-backoff on the **same auth profile** before applying cooldown or rotating. ### New Configuration ```json5 agents: { defaults: { modelFailover: { retrySameProfileOnTimeout: 1, // default: 1 retry retryBackoffMs: [300, 1200] // default: 300ms, 1200ms } } } ``` ### Behavior - **First timeout:** Retry same profile after 300ms ±30% jitter - **Second timeout:** Retry after 1200ms ±30% jitter - **Third timeout:** Apply cooldown + rotate (existing behavior) - **Non-timeout failures:** Immediate rotation (unchanged) - **Success:** Reset timeout counter ### Implementation Details **Files changed:** - `src/config/types.agent-defaults.ts` - Add `modelFailover` config type - `src/config/zod-schema.agent-defaults.ts` - Add Zod schema validation - `src/agents/pi-embedded-runner/run.ts` - Implement retry logic **Key logic:** - Track `consecutiveTimeouts` per profile - Check retry limit before calling `markAuthProfileFailure` - Apply jittered backoff delay - Only rotate after retries exhausted ## Testing - [x] Lint passes - [ ] Manual test: timeout → retry → success - [ ] Manual test: timeout → timeout → cooldown + rotate - [ ] Unit test: retry counter resets on success - [ ] Unit test: non-timeout failures skip retry ## Backward Compatibility ✅ Fully backward compatible: - Default config (`retrySameProfileOnTimeout: 1`) adds 1 retry before existing behavior - Existing configs without `modelFailover` use defaults - Non-timeout failures unchanged ## Related Fixes #23317 ## Acceptance Criteria - [x] Single timeout retries same profile with jittered delay - [x] Cooldown only applied after retries exhausted - [x] Explicit rate-limit failures (429) still trigger immediate cooldown - [x] Logs show retry attempt number and delay - [ ] Tests verify retry behavior <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR adds retry-with-backoff for timeout failures before rotating auth profiles, reducing unnecessary profile exhaustion from transient network issues. **Implementation:** - Tracks `consecutiveTimeouts` per profile within the run loop - Retries same profile with jittered backoff delays before applying cooldown - Resets counter on non-timeout failures - Config: `modelFailover.retrySameProfileOnTimeout` (default: 1) and `retryBackoffMs` (default: [300, 1200]) **Issues found:** - Jitter calculation only adds positive delay (0-30%) instead of ±30% as documented, potentially causing thundering herd - Duplicate timeout log messages when retries are exhausted (line 994 and line 1020) - PR description mentions 2 retries before rotation but default config is 1 retry (mismatch between description and code) <h3>Confidence Score: 3/5</h3> - Safe to merge with one critical jitter fix needed - The retry logic is sound and properly integrated into the existing failover flow. However, the jitter calculation bug could cause synchronized retries across multiple clients (thundering herd problem). The duplicate logging is a minor quality issue. After fixing the jitter calculation, this would be a 4. - src/agents/pi-embedded-runner/run.ts needs jitter calculation fix before merging <sub>Last reviewed commit: 4352bd0</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs