#20911: fix: auto-reset session when all models time out
size: XS
Cluster:
Rate Limit Management Enhancements
Closes #20910
## What
Adds a circuit breaker that auto-resets the session when all fallback models time out, breaking the death spiral on first occurrence.
## Why
When a session grows very large, image sanitization and serialization consume the entire 120s timeout budget. All fallback models fail. The error does not match isLikelyContextOverflowError, so no self-healing triggers and the gateway retries indefinitely.
## How
- Detect all-timed-out pattern via regex on FailoverError message
- Exclude rate-limit timeouts via negative lookahead
- Call existing resetSessionAfterCompactionFailure to clear the session
- Return user-facing message explaining the reset
- Uses existing didResetAfterCompactionFailure guard to prevent double-reset
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
adds circuit breaker to auto-reset sessions when all fallback models time out, preventing infinite retry loops. also introduces per-model cooldown tracking to prevent one slow model from blocking other models on the same auth profile.
**Key changes:**
- new timeout detection regex in `agent-runner-execution.ts` triggers session reset when all models time out
- per-model cooldown infrastructure added (`isProfileInCooldownForModel`, `modelCooldowns` field)
- improved failover reason classification to distinguish timeouts from unknown errors
- copilot SDK auth handling for env-based token exchange
- queue management utility `removeFollowupRunByMessageId` added
- unrelated: peon-sounds plugin extension (361 lines)
**Critical issue found:**
- per-model cooldown feature is incomplete - `computeNextProfileUsageStats` accepts `modelId` parameter but never sets `updatedStats.modelCooldowns[modelId]`, causing tests to fail and the feature to be non-functional
<h3>Confidence Score: 1/5</h3>
- critical bug blocks main feature - per-model cooldowns not implemented
- the core per-model cooldown feature has a logic bug where `modelCooldowns` is never populated, making it non-functional. tests expect this behavior but the implementation is incomplete in `computeNextProfileUsageStats`. the timeout detection logic appears sound but cannot work correctly without the cooldown mechanism
- `src/agents/auth-profiles/usage.ts` requires immediate fix to implement per-model cooldown logic in `computeNextProfileUsageStats` function
<sub>Last reviewed commit: bca661f</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#19267: fix: derive failover reason from timedOut flag to prevent unknown c...
by austenstone · 2026-02-17
79.8%
#14574: fix: gentler rate-limit cooldown backoff + clear stale cooldowns on...
by JamesEBall · 2026-02-12
78.8%
#4462: fix: prevent gateway crash when all auth profiles are in cooldown
by garnetlyx · 2026-01-30
78.2%
#18902: fix: exempt format errors from auth profile cooldown
by tag-assistant · 2026-02-17
78.0%
#14824: fix: do not trigger provider cooldown on LLM request timeouts
by CyberSinister · 2026-02-12
77.3%
#20388: fix(failover): don't skip same-provider fallback models when cooldo...
by Limitless2023 · 2026-02-18
76.8%
#13077: fix: prevent cooldown pollution across different models on the same...
by magendary · 2026-02-10
76.3%
#16797: fix(auth-profiles): implement per-model rate limit cooldown tracking
by mulhamna · 2026-02-15
76.1%
#21731: Fix: clear stale session runtime model on /new, /reset, and session...
by AIflow-Labs · 2026-02-20
75.8%
#23564: feat(auth): add timeout retry before auth profile rotation
by echoVic · 2026-02-22
75.3%