← Back to PRs

#21797: feat: provider-level circuit breaker for model fallback chain

by Glucksberg open 2026-02-20 11:32 View on GitHub →
agents size: M experienced-contributor
## Summary When a provider fails with an **auth** (401/403) or **billing** error, the fallback chain currently keeps trying other models from the same broken provider — wasting time on requests that will inevitably fail (same credentials, same billing account). This PR adds a provider-level circuit breaker: once a provider fails with a provider-scoped error, all remaining candidates from that provider are skipped immediately. ## How it works A local `Set<string>` (`failedProviders`) tracks providers that failed during the current invocation: ``` candidates: [anthropic/opus, anthropic/sonnet, openrouter/kimi] 1. Try anthropic/opus → 401 (auth failure) → failedProviders.add("anthropic") 2. Next: anthropic/sonnet → failedProviders.has("anthropic") → SKIP (no request made) 3. Next: openrouter/kimi → failedProviders.has("openrouter") → false → Try → succeeds ✓ ``` **What triggers the breaker:** `auth` (401/403) and `billing` — these are provider-wide failures (shared credentials / billing account across all models). **What does NOT trigger it:** `rate_limit` (profile-specific), `timeout` (model-specific), `format` (model-specific). These still try the next model from the same provider normally. **Scope:** The Set is a local variable — lives only for that single invocation, no persistence, no side effects. **Provider-agnostic:** Works with any provider string (anthropic, openrouter, openai, custom providers via `models.providers`). No providers are hardcoded. ## Changes - **`src/agents/model-fallback.ts`** - Added `isProviderScopedFailure()` helper (auth or billing = provider-scoped) - Added circuit breaker to `runWithModelFallback()` — skip + track logic - Added circuit breaker to `runWithImageModelFallback()` — same pattern - Upgraded `runWithImageModelFallback()` error classification to use `coerceToFailoverError` + `describeFailoverError` (previously used raw error message only, missing status-code-aware classification) - **`src/agents/model-fallback.e2e.test.ts`** - 4 new tests: auth skip, billing skip, rate_limit no-skip, timeout no-skip ## Test plan - [x] `pnpm build` — clean compilation - [x] `vitest run src/agents/model-fallback.e2e.test.ts` — 28 tests pass - [x] Circuit breaker skips same-provider after auth 401 - [x] Circuit breaker skips same-provider after billing failure - [x] Circuit breaker does NOT skip on rate_limit (429) - [x] Circuit breaker does NOT skip on timeout 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR adds a provider-level circuit breaker to prevent wasting time retrying models from a provider that has already failed with auth (401/403) or billing (402) errors. The implementation is clean and well-tested: - Introduces `isProviderScopedFailure()` helper to identify auth/billing failures as provider-wide - Tracks failed providers in a local `Set<string>` that lives only for the current invocation - Skips remaining candidates from failed providers without making requests - Applies the same circuit breaker pattern to both `runWithModelFallback()` and `runWithImageModelFallback()` - Upgrades `runWithImageModelFallback()` error classification to use status-code-aware detection (previously relied only on error messages) - Includes comprehensive test coverage for auth, billing, rate_limit, and timeout scenarios The circuit breaker correctly distinguishes between provider-scoped failures (auth/billing) and model-specific failures (rate_limit, timeout, format), ensuring fallback logic remains intact for transient or model-specific issues. <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with no identified issues - The implementation is straightforward, well-tested, and solves a real performance problem. The circuit breaker logic is simple and scoped locally to each invocation (no persistent state or side effects). The changes upgrade error classification in `runWithImageModelFallback()` which improves consistency. All 4 new tests pass and validate the correct behavior for auth/billing skip and rate_limit/timeout no-skip scenarios. - No files require special attention <sub>Last reviewed commit: 4237c6b</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs