#14914: fix: resolve actual failure reason for cooldown-skipped providers
agents
size: S
trusted-contributor
experienced-contributor
Cluster:
Rate Limit Management Enhancements
Fixes #13909
## Summary
- When all auth profiles for a provider are in cooldown, the fallback loop hardcoded `reason: "rate_limit"` regardless of the actual failure that caused the cooldown (e.g. OAuth 403 → "auth", billing 402 → "billing")
- Added `resolveDominantCooldownReason()` that inspects the `failureCounts` stored in profile usage stats and returns the most representative failure reason
- Falls back to `"rate_limit"` when no failure data is recorded (backward-compatible default)
## Test plan
- [x] New test: profile in cooldown with `failureCounts: { auth: 1 }` → attempt reason is `"auth"`
- [x] New test: profile in cooldown with `failureCounts: { billing: 1 }` → attempt reason is `"billing"`
- [x] Existing test: profile in cooldown with no `failureCounts` → reason remains `"rate_limit"` (backward compat)
- [x] All 22 tests pass; 2 new tests fail before fix, pass after
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR changes model fallback behavior when a provider is skipped because *all* of its auth profiles are in cooldown/disabled. Previously, the skip attempt hardcoded `reason: "rate_limit"`; it now calls `resolveDominantCooldownReason()` which aggregates `usageStats[profileId].failureCounts` across the provider’s profiles and returns the highest-count `AuthProfileFailureReason`, defaulting to `"rate_limit"` when no usable failure data exists.
Two new tests cover the new behavior for cooldowns driven by `auth` and `billing` failures, and the existing backward-compatible case (no `failureCounts`) remains `"rate_limit"`.
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk.
- Change is narrow (only affects the cooldown-skip branch), reasons remain within the existing FailoverReason/AuthProfileFailureReason unions, and new tests cover the new behavior while preserving the prior default when failureCounts is absent/invalid.
- No files require special attention
<sub>Last reviewed commit: 05811a8</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#14574: fix: gentler rate-limit cooldown backoff + clear stale cooldowns on...
by JamesEBall · 2026-02-12
85.7%
#20946: fix: skip auth cooldown on timeout (not an auth failure)
by austenstone · 2026-02-19
84.5%
#14824: fix: do not trigger provider cooldown on LLM request timeouts
by CyberSinister · 2026-02-12
83.2%
#23210: fix: avoid cooldown on timeout/unknown failovers
by nydamon · 2026-02-22
83.1%
#19267: fix: derive failover reason from timedOut flag to prevent unknown c...
by austenstone · 2026-02-17
81.5%
#20388: fix(failover): don't skip same-provider fallback models when cooldo...
by Limitless2023 · 2026-02-18
81.4%
#13077: fix: prevent cooldown pollution across different models on the same...
by magendary · 2026-02-10
80.6%
#18902: fix: exempt format errors from auth profile cooldown
by tag-assistant · 2026-02-17
80.2%
#15881: fix(models): probe-safe cooldown handling and compatible fallback a...
by wboudy · 2026-02-14
79.1%
#11371: Auth: cap rate-limit cooldown at 5 minutes; add maxCooldownMinutes ...
by lailoo · 2026-02-07
77.8%