#4462: fix: prevent gateway crash when all auth profiles are in cooldown
agents
Cluster:
Model Fallbacks and Rate Limiting
## Summary
- Fix gateway crash when all auth profiles are in cooldown
- Replace generic `Error` with typed `AllModelsFailedError` class
- Add graceful handling in unhandled rejection handler
## Changes
- Add `AllModelsFailedError` class with cooldown detection (`src/agents/model-fallback-error.ts`)
- Modify `runWithModelFallback()` to throw `AllModelsFailedError` with retry timing
- Add handler in `unhandled-rejections.ts` to log warning instead of `process.exit(1)`
## Test plan
- [x] Unit tests for `AllModelsFailedError` class (6 tests)
- [x] Existing `model-fallback.test.ts` tests pass (19 tests)
- [x] Existing `unhandled-rejections.test.ts` tests pass (19 tests)
- [x] `pnpm lint` passes
- [x] `pnpm build` passes
## AI Disclosure 🤖
- **AI-assisted**: Yes (Claude Opus 4.5)
- **Testing level**: Fully tested
- **Understanding**: Confirmed - error class captures cooldown state, handler checks `allInCooldown` flag to decide whether to crash or continue
Fixes #2811
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR introduces a typed `AllModelsFailedError` to represent model-fallback exhaustion, including whether failures were due to cooldown and an optional computed `retryAfterMs`. `runWithModelFallback()` now records cooldown-skipped providers as `rate_limit` attempts, computes an earliest retry time from auth profile stats when everything is in cooldown, and throws `AllModelsFailedError` instead of a generic `Error`. The unhandled rejection handler is updated to detect `AllModelsFailedError` and avoid exiting the gateway (logging a warning instead), preventing crashes when all auth profiles are in cooldown.
Overall the change fits the existing resiliency approach in `src/infra/unhandled-rejections.ts` (which already suppresses some non-fatal classes like AbortError/transient network errors), by carving out model-cooldown exhaustion as a non-fatal condition.
<h3>Confidence Score: 4/5</h3>
- This PR is likely safe to merge, but it slightly broadens when the gateway suppresses crashes on model-fallback failures.
- The core change (typed error + cooldown detection) is straightforward and covered by tests, but `installUnhandledRejectionHandler` currently suppresses *all* `AllModelsFailedError` cases (including mixed/auth failures) which may hide serious configuration/credential issues if they propagate as this error type.
- src/infra/unhandled-rejections.ts
<!-- greptile_other_comments_section -->
<sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#13658: fix: silent model failover with fallback notification
by taw0002 · 2026-02-10
83.2%
#14574: fix: gentler rate-limit cooldown backoff + clear stale cooldowns on...
by JamesEBall · 2026-02-12
82.2%
#9427: fix: trigger model fallback on all 4xx HTTP errors
by dbottme · 2026-02-05
81.9%
#21152: fix(agents): throw FailoverError for unknown model so fallback chai...
by Mellowambience · 2026-02-19
81.6%
#11349: fix(agents): do not filter fallback models by models allowlist
by liuxiaopai-ai · 2026-02-07
81.0%
#10178: fix: trigger fallback when model resolution fails with unknown model
by Yida-Dev · 2026-02-06
80.9%
#9163: Fix: Save Anthropic setup token to config file
by vishaltandale00 · 2026-02-04
80.8%
#19267: fix: derive failover reason from timedOut flag to prevent unknown c...
by austenstone · 2026-02-17
80.5%
#12314: fix: treat HTTP 5xx server errors as failover-worthy
by hsssgdtc · 2026-02-09
80.4%
#19020: bugfix(gateway): Handle invalid model provider API config gracefully\…
by funkyjonx · 2026-02-17
80.3%