#18902: fix: exempt format errors from auth profile cooldown

by tag-assistant open 2026-02-17 05:27 View on GitHub →

agents size: XS

Cluster: Rate Limit Management Enhancements

## Summary - **Problem:** A single format error (e.g. orphaned tool_use_id after session corruption) puts the auth profile into cooldown, blocking all models on the same provider. - **Why it matters:** Cascading false rate-limits kill the entire agent run when all models share one provider/profile. - **What changed:** computeNextProfileUsageStats now exempts format from cooldown (same as timeout). Added 4 tests. - **What did NOT change:** Billing disable and rate_limit cooldown paths are untouched. ## Change Type (select all) - [x] Bug fix ## Scope (select all touched areas) - [x] Auth / tokens ## User-visible / Behavior Changes Format errors no longer cascade into blocking all models on a provider. The failing model still falls back normally; other models remain available. ## Security Impact (required) - New permissions/capabilities? No - Secrets/tokens handling changed? No - New/changed network calls? No - Command/tool execution surface changed? No - Data access scope changed? No ## Repro + Verification ### Steps 1. Have a corrupted session with orphaned tool_use_id 2. Run agent -- first model gets 400 (format error) 3. Before fix: all remaining models report rate_limit (profile in cooldown) 4. After fix: fallback proceeds to next model normally ### Expected Other models on the same provider remain available after a format error. ### Actual (before fix) All 5 models fail with rate_limit because the single auth profile enters cooldown. ## Evidence - [x] Failing test/log before + passing after - 21/21 tests pass including 4 new tests for markAuthProfileFailure ## Human Verification (required) - Verified scenarios: format error does not set cooldown, timeout does not set cooldown, rate_limit does set cooldown, billing sets disable - Edge cases checked: single-profile provider with multiple models - What you did not verify: live agent run with corrupted session ## Compatibility / Migration - Backward compatible? Yes - Config/env changes? No - Migration needed? No ## Failure Recovery (if this breaks) - Revert this commit to restore previous behavior where format errors trigger cooldown. ## Risks and Mitigations - Risk: Repeated format errors no longer trigger cooldown, so a persistently broken transcript could cause rapid retries. - Mitigation: Format errors still count toward errorCount and the model fallback loop moves on; the transcript repair code prevents most recurring format errors.  <h3>Greptile Summary</h3> Exempts `format` and `timeout` errors from auth profile cooldown to prevent cascading false rate-limits when multiple models share the same provider. - Adds conditional check in `computeNextProfileUsageStats` to skip cooldown for `format` and `timeout` reasons - `errorCount` still increments for tracking, but `cooldownUntil` remains unset - Billing and rate_limit errors continue to trigger cooldown/disable as expected - Test coverage validates all four failure types: format (no cooldown), timeout (no cooldown), rate_limit (cooldown), billing (disable) <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk - The change is surgical and well-tested. It adds a single conditional check that preserves existing behavior for billing and rate_limit while fixing the cascading cooldown issue for format and timeout errors. All four error types have test coverage, and the logic correctly maintains error counting while preventing inappropriate cooldowns. - No files require special attention <sub>Last reviewed commit: 0a3ec2b</sub>