← Back to PRs

#4462: fix: prevent gateway crash when all auth profiles are in cooldown

by garnetlyx open 2026-01-30 07:33 View on GitHub →
agents
## Summary - Fix gateway crash when all auth profiles are in cooldown - Replace generic `Error` with typed `AllModelsFailedError` class - Add graceful handling in unhandled rejection handler ## Changes - Add `AllModelsFailedError` class with cooldown detection (`src/agents/model-fallback-error.ts`) - Modify `runWithModelFallback()` to throw `AllModelsFailedError` with retry timing - Add handler in `unhandled-rejections.ts` to log warning instead of `process.exit(1)` ## Test plan - [x] Unit tests for `AllModelsFailedError` class (6 tests) - [x] Existing `model-fallback.test.ts` tests pass (19 tests) - [x] Existing `unhandled-rejections.test.ts` tests pass (19 tests) - [x] `pnpm lint` passes - [x] `pnpm build` passes ## AI Disclosure 🤖 - **AI-assisted**: Yes (Claude Opus 4.5) - **Testing level**: Fully tested - **Understanding**: Confirmed - error class captures cooldown state, handler checks `allInCooldown` flag to decide whether to crash or continue Fixes #2811 <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR introduces a typed `AllModelsFailedError` to represent model-fallback exhaustion, including whether failures were due to cooldown and an optional computed `retryAfterMs`. `runWithModelFallback()` now records cooldown-skipped providers as `rate_limit` attempts, computes an earliest retry time from auth profile stats when everything is in cooldown, and throws `AllModelsFailedError` instead of a generic `Error`. The unhandled rejection handler is updated to detect `AllModelsFailedError` and avoid exiting the gateway (logging a warning instead), preventing crashes when all auth profiles are in cooldown. Overall the change fits the existing resiliency approach in `src/infra/unhandled-rejections.ts` (which already suppresses some non-fatal classes like AbortError/transient network errors), by carving out model-cooldown exhaustion as a non-fatal condition. <h3>Confidence Score: 4/5</h3> - This PR is likely safe to merge, but it slightly broadens when the gateway suppresses crashes on model-fallback failures. - The core change (typed error + cooldown detection) is straightforward and covered by tests, but `installUnhandledRejectionHandler` currently suppresses *all* `AllModelsFailedError` cases (including mixed/auth failures) which may hide serious configuration/credential issues if they propagate as this error type. - src/infra/unhandled-rejections.ts <!-- greptile_other_comments_section --> <sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub> <!-- /greptile_comment -->

Most Similar PRs