#18902: fix: exempt format errors from auth profile cooldown
agents
size: XS
Cluster:
Rate Limit Management Enhancements
## Summary
- **Problem:** A single format error (e.g. orphaned tool_use_id after session corruption) puts the auth profile into cooldown, blocking all models on the same provider.
- **Why it matters:** Cascading false rate-limits kill the entire agent run when all models share one provider/profile.
- **What changed:** computeNextProfileUsageStats now exempts format from cooldown (same as timeout). Added 4 tests.
- **What did NOT change:** Billing disable and rate_limit cooldown paths are untouched.
## Change Type (select all)
- [x] Bug fix
## Scope (select all touched areas)
- [x] Auth / tokens
## User-visible / Behavior Changes
Format errors no longer cascade into blocking all models on a provider. The failing model still falls back normally; other models remain available.
## Security Impact (required)
- New permissions/capabilities? No
- Secrets/tokens handling changed? No
- New/changed network calls? No
- Command/tool execution surface changed? No
- Data access scope changed? No
## Repro + Verification
### Steps
1. Have a corrupted session with orphaned tool_use_id
2. Run agent -- first model gets 400 (format error)
3. Before fix: all remaining models report rate_limit (profile in cooldown)
4. After fix: fallback proceeds to next model normally
### Expected
Other models on the same provider remain available after a format error.
### Actual (before fix)
All 5 models fail with rate_limit because the single auth profile enters cooldown.
## Evidence
- [x] Failing test/log before + passing after
- 21/21 tests pass including 4 new tests for markAuthProfileFailure
## Human Verification (required)
- Verified scenarios: format error does not set cooldown, timeout does not set cooldown, rate_limit does set cooldown, billing sets disable
- Edge cases checked: single-profile provider with multiple models
- What you did not verify: live agent run with corrupted session
## Compatibility / Migration
- Backward compatible? Yes
- Config/env changes? No
- Migration needed? No
## Failure Recovery (if this breaks)
- Revert this commit to restore previous behavior where format errors trigger cooldown.
## Risks and Mitigations
- Risk: Repeated format errors no longer trigger cooldown, so a persistently broken transcript could cause rapid retries.
- Mitigation: Format errors still count toward errorCount and the model fallback loop moves on; the transcript repair code prevents most recurring format errors.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Exempts `format` and `timeout` errors from auth profile cooldown to prevent cascading false rate-limits when multiple models share the same provider.
- Adds conditional check in `computeNextProfileUsageStats` to skip cooldown for `format` and `timeout` reasons
- `errorCount` still increments for tracking, but `cooldownUntil` remains unset
- Billing and rate_limit errors continue to trigger cooldown/disable as expected
- Test coverage validates all four failure types: format (no cooldown), timeout (no cooldown), rate_limit (cooldown), billing (disable)
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk
- The change is surgical and well-tested. It adds a single conditional check that preserves existing behavior for billing and rate_limit while fixing the cascading cooldown issue for format and timeout errors. All four error types have test coverage, and the logic correctly maintains error counting while preventing inappropriate cooldowns.
- No files require special attention
<sub>Last reviewed commit: 0a3ec2b</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#14824: fix: do not trigger provider cooldown on LLM request timeouts
by CyberSinister · 2026-02-12
83.0%
#14574: fix: gentler rate-limit cooldown backoff + clear stale cooldowns on...
by JamesEBall · 2026-02-12
82.2%
#14368: fix: skip auth profile cooldown on format errors to prevent provide...
by koatora20 · 2026-02-12
81.2%
#19267: fix: derive failover reason from timedOut flag to prevent unknown c...
by austenstone · 2026-02-17
80.2%
#14914: fix: resolve actual failure reason for cooldown-skipped providers
by mcaxtr · 2026-02-12
80.2%
#20388: fix(failover): don't skip same-provider fallback models when cooldo...
by Limitless2023 · 2026-02-18
80.2%
#20946: fix: skip auth cooldown on timeout (not an auth failure)
by austenstone · 2026-02-19
79.7%
#16797: fix(auth-profiles): implement per-model rate limit cooldown tracking
by mulhamna · 2026-02-15
79.2%
#13077: fix: prevent cooldown pollution across different models on the same...
by magendary · 2026-02-10
78.9%
#23210: fix: avoid cooldown on timeout/unknown failovers
by nydamon · 2026-02-22
78.7%