#21797: feat: provider-level circuit breaker for model fallback chain
agents
size: M
experienced-contributor
Cluster:
Model Cooldown Management
## Summary
When a provider fails with an **auth** (401/403) or **billing** error, the fallback chain currently keeps trying other models from the same broken provider — wasting time on requests that will inevitably fail (same credentials, same billing account).
This PR adds a provider-level circuit breaker: once a provider fails with a provider-scoped error, all remaining candidates from that provider are skipped immediately.
## How it works
A local `Set<string>` (`failedProviders`) tracks providers that failed during the current invocation:
```
candidates: [anthropic/opus, anthropic/sonnet, openrouter/kimi]
1. Try anthropic/opus → 401 (auth failure)
→ failedProviders.add("anthropic")
2. Next: anthropic/sonnet
→ failedProviders.has("anthropic") → SKIP (no request made)
3. Next: openrouter/kimi
→ failedProviders.has("openrouter") → false
→ Try → succeeds ✓
```
**What triggers the breaker:** `auth` (401/403) and `billing` — these are provider-wide failures (shared credentials / billing account across all models).
**What does NOT trigger it:** `rate_limit` (profile-specific), `timeout` (model-specific), `format` (model-specific). These still try the next model from the same provider normally.
**Scope:** The Set is a local variable — lives only for that single invocation, no persistence, no side effects.
**Provider-agnostic:** Works with any provider string (anthropic, openrouter, openai, custom providers via `models.providers`). No providers are hardcoded.
## Changes
- **`src/agents/model-fallback.ts`**
- Added `isProviderScopedFailure()` helper (auth or billing = provider-scoped)
- Added circuit breaker to `runWithModelFallback()` — skip + track logic
- Added circuit breaker to `runWithImageModelFallback()` — same pattern
- Upgraded `runWithImageModelFallback()` error classification to use `coerceToFailoverError` + `describeFailoverError` (previously used raw error message only, missing status-code-aware classification)
- **`src/agents/model-fallback.e2e.test.ts`**
- 4 new tests: auth skip, billing skip, rate_limit no-skip, timeout no-skip
## Test plan
- [x] `pnpm build` — clean compilation
- [x] `vitest run src/agents/model-fallback.e2e.test.ts` — 28 tests pass
- [x] Circuit breaker skips same-provider after auth 401
- [x] Circuit breaker skips same-provider after billing failure
- [x] Circuit breaker does NOT skip on rate_limit (429)
- [x] Circuit breaker does NOT skip on timeout
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR adds a provider-level circuit breaker to prevent wasting time retrying models from a provider that has already failed with auth (401/403) or billing (402) errors. The implementation is clean and well-tested:
- Introduces `isProviderScopedFailure()` helper to identify auth/billing failures as provider-wide
- Tracks failed providers in a local `Set<string>` that lives only for the current invocation
- Skips remaining candidates from failed providers without making requests
- Applies the same circuit breaker pattern to both `runWithModelFallback()` and `runWithImageModelFallback()`
- Upgrades `runWithImageModelFallback()` error classification to use status-code-aware detection (previously relied only on error messages)
- Includes comprehensive test coverage for auth, billing, rate_limit, and timeout scenarios
The circuit breaker correctly distinguishes between provider-scoped failures (auth/billing) and model-specific failures (rate_limit, timeout, format), ensuring fallback logic remains intact for transient or model-specific issues.
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with no identified issues
- The implementation is straightforward, well-tested, and solves a real performance problem. The circuit breaker logic is simple and scoped locally to each invocation (no persistent state or side effects). The changes upgrade error classification in `runWithImageModelFallback()` which improves consistency. All 4 new tests pass and validate the correct behavior for auth/billing skip and rate_limit/timeout no-skip scenarios.
- No files require special attention
<sub>Last reviewed commit: 4237c6b</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#13188: fix: add cross-provider fallback when primary provider is rate-limited
by 1bcMax · 2026-02-10
77.6%
#20388: fix(failover): don't skip same-provider fallback models when cooldo...
by Limitless2023 · 2026-02-18
77.3%
#22368: fix: first-token timeout + provider-level skip for model fallback
by 88plug · 2026-02-21
77.3%
#19252: fix(agents): continue model fallback on failover text payloads
by mahsumaktas · 2026-02-17
77.1%
#13658: fix: silent model failover with fallback notification
by taw0002 · 2026-02-10
76.2%
#23816: fix(agents): model fallback skipped during session overrides and pr...
by ramezgaberiel · 2026-02-22
76.1%
#13077: fix: prevent cooldown pollution across different models on the same...
by magendary · 2026-02-10
75.6%
#14914: fix: resolve actual failure reason for cooldown-skipped providers
by mcaxtr · 2026-02-12
75.5%
#22064: fix(failover): bypass models allowlist for configured fallback models
by winston-bepresent · 2026-02-20
75.2%
#16307: fix: surface billing/auth FailoverErrors as user-friendly messages
by petter-b · 2026-02-14
75.1%