#23816: fix(agents): model fallback skipped during session overrides and provider cooldowns

by ramezgaberiel open 2026-02-22 18:23 View on GitHub →

agents size: M

Cluster: Model Configuration and Fallback Fixes

## Summary Describe the problem and fix in 2–5 bullets: - **Problem:** Model fallback system skips all configured fallbacks when session model differs from config primary, and provider cooldowns block same-provider fallback attempts - **Why it matters:** Users hitting quota limits lose fallback protection when doing normal model switches, affecting any users who expect configured fallbacks to work - **What changed:** Modified comparison logic from exact model string matching to provider-only comparison, and cooldown logic to allow fallback attempts even during provider cooldown - **What did NOT change (scope boundary):** All existing fallback behavior for other error scenarios, cross-provider blocking logic, auth profile management, probe timing for primary models ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #19249 - Related # ## User-visible / Behavior Changes - Model fallbacks now work correctly when user switches models in session (e.g., from Opus to Sonnet for quota management) - Same-provider fallbacks now attempt even during provider cooldown periods - No configuration changes required - existing fallback configs will now work as expected - Error messages remain the same when all fallbacks are exhausted ## Security Impact (required) - New permissions/capabilities? (`No`) - Secrets/tokens handling changed? (`No`) - New/changed network calls? (`No`) - Command/tool execution surface changed? (`No`) - Data access scope changed? (`No`) - If any `Yes`, explain risk + mitigation: ## Repro + Verification ### Environment - OS: macOS Darwin 25.2.0 - Runtime/container: Node.js with OpenClaw 2026.2.21-2 - Model/provider: Anthropic Claude (primary: opus-4-6, session: sonnet-4-20250514) - Integration/channel (if any): Discord - Relevant config (redacted): ```json { "agents": { "defaults": { "model": { "primary": "anthropic/claude-opus-4-6", "fallbacks": ["anthropic/claude-sonnet-4-5", "groq/llama-3.3-70b-versatile"] } } } } ``` ### Steps 1. Configure primary model as `anthropic/claude-opus-4-6` with fallbacks (or whichever ai model you use) 2. Switch session model: `openclaw models set anthropic/claude-sonnet-4-20250514` 3. Hit quota limit on account 4. Observe fallback behavior ### Expected - System should attempt configured fallback models (sonnet-4-5, then groq) - Fallbacks should work despite session model being different from config primary ### Actual - **Before fix:** All fallbacks skipped, system fails with "quota exceeded" - **After fix:** Fallbacks attempted correctly, system falls back to groq when anthropic quota exhausted ## Evidence Attach at least one: - [x] Failing test/log before + passing after - [ ] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) **Test Results:** - All 32 tests passing (0 skipped) - Added comprehensive test coverage for both session override and provider cooldown scenarios - Includes edge cases like cross-provider requests, auth profile mocking, and cooldown timing ## Human Verification (required) What you personally verified (not just CI), and how: - **Verified scenarios:** - Session model override with same provider (sonnet session, opus config) → fallbacks work - Provider cooldown with multiple same-provider fallbacks → attempts made despite cooldown - Cross-provider override behavior → still blocks fallbacks as intended - **Edge cases checked:** - Model version differences within same provider - Auth profile availability for fallback providers - Backwards compatibility with deprecated function usage - **What you did not verify:** - Actual quota limit scenarios (would require burning through real API quotas) ## Compatibility / Migration - Backward compatible? (`Yes`) - Config/env changes? (`No`) - Migration needed? (`No`) - If yes, exact upgrade steps: ## Failure Recovery (if this breaks) - **How to disable/revert this change quickly:** Revert the changes to `src/agents/model-fallback.ts` lines ~220 and ~345-365 - **Files/config to restore:** Only code changes, no config files affected - **Known bad symptoms reviewers should watch for:** Fallbacks not working in scenarios that previously worked, infinite fallback loops, auth profile selection issues ## Risks and Mitigations List only real risks for this PR. Add/remove entries as needed. If none, write `None`. - **Risk:** Could break existing fallback behavior for edge cases not covered by tests - **Mitigation:** Comprehensive test suite covering all existing scenarios, backwards compatibility preservation, deprecated function maintained - **Risk:** Provider cooldown logic changes might cause authentication issues - **Mitigation:** Preserved existing probe logic for primary models, only modified fallback candidate attempts  <h3>Greptile Summary</h3> This PR fixes two related bugs in the model fallback system that prevented configured fallbacks from working correctly. The changes allow same-provider model switches (e.g., opus to sonnet) to use configured fallbacks, and enable fallback attempts even during provider cooldowns since rate limits are often model-specific. **Key Changes:** - Modified `resolveFallbackCandidates` in src/agents/model-fallback.ts:222-228 to compare provider strings instead of exact model matches, allowing version differences within the same provider - Updated cooldown logic in src/agents/model-fallback.ts:352-373 to always attempt fallback models during provider cooldowns (only primary models use probe throttling) - Deprecated `sameModelCandidate` function with backwards compatibility preservation (src/agents/model-fallback.ts:100-106) - Added `.ark/` to `.gitignore` (unrelated cleanup) **Test Coverage:** Added 222 lines of comprehensive test coverage including: - Session model overrides with same provider - Cross-provider override behavior - Provider cooldown scenarios with same-provider fallbacks - Auth profile mocking and cooldown timing edge cases The logic changes align with the PR description and maintain backwards compatibility. The fix is well-scoped and addresses the specific issue without changing other fallback behaviors. <h3>Confidence Score: 4/5</h3> - This PR is safe to merge with minimal risk - it fixes a clear bug with thorough test coverage - Score reflects well-tested logic changes with comprehensive coverage (222 new test lines covering edge cases), clear backwards compatibility preservation (deprecated function maintained), and minimal surface area (two focused logic changes). Deducted one point for a minor code duplication issue (`isPrimary` computed twice) that doesn't affect correctness but slightly impacts code quality. - No files require special attention - the changes are well-contained and thoroughly tested <sub>Last reviewed commit: defb040</sub>  <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>