← Back to PRs

#19077: fix(agents): trigger model failover on connection-refused and network-unreachable errors

by ayanesakura open 2026-02-17 10:09 View on GitHub →
agents size: S
## Summary Closes #18868 Network connection errors (`ECONNREFUSED`, `ENETUNREACH`, `EHOSTUNREACH`, `ENETRESET`, `EAI_AGAIN`) were not recognized as failover-worthy errors, so the fallback chain was never advanced when the primary provider was unreachable due to a network outage or a stopped local server. This is especially impactful when a **local fallback model** (e.g. Ollama on localhost) is configured alongside a remote primary: if the network goes down, the gateway should seamlessly fall back to the local model instead of surfacing an error to the user. ### Changes - **`src/agents/failover-error.ts`** — Added `ECONNREFUSED`, `ENETUNREACH`, `EHOSTUNREACH`, `ENETRESET`, and `EAI_AGAIN` to the error-code list in `resolveFailoverReasonFromError()` that triggers failover (categorized as `"timeout"`, consistent with the existing `ETIMEDOUT` / `ECONNRESET` / `ECONNABORTED` handling). - **`src/agents/model-fallback.e2e.test.ts`** — Added 4 e2e tests covering each new error code scenario. ### Why these specific codes | Code | When it fires | |------|--------------| | `ECONNREFUSED` | Target port not listening (e.g. Ollama/vLLM stopped, or remote server down) | | `ENETUNREACH` | No route to network (e.g. Wi-Fi/Ethernet disconnected) | | `EHOSTUNREACH` | Specific host unreachable (e.g. VPN down, firewall block) | | `ENETRESET` | Connection reset by network (e.g. NAT timeout, ISP reset) | | `EAI_AGAIN` | DNS resolution temporarily failed (e.g. DNS server unreachable) | ## Test plan - [x] All 4 new e2e tests pass (`pnpm test:e2e -- src/agents/model-fallback.e2e.test.ts`) - [x] Existing failover tests unaffected (the one pre-existing flaky test `skips providers when all profiles are in cooldown` fails on `main` as well) - [x] `oxfmt` and `oxlint` pass on changed files <!-- greptile_comment --> <h3>Greptile Summary</h3> Adds network connection error codes to the failover mechanism so the agent can gracefully fall back to alternative models when the primary provider is unreachable due to network issues. - Extended the error code list in `resolveFailoverReasonFromError()` to include `ECONNREFUSED`, `ENETUNREACH`, `EHOSTUNREACH`, `ENETRESET`, and `EAI_AGAIN`, categorized as timeout errors - Added 4 e2e tests covering the new error scenarios (missing test for `ENETRESET`) - Consistent with existing network error handling in `src/telegram/network-errors.ts` - Enables seamless failover to local models (e.g., Ollama) when remote providers are down <h3>Confidence Score: 4/5</h3> - Safe to merge with one test case addition recommended - The change is well-implemented and consistent with existing patterns in the codebase. The new error codes are already recognized elsewhere (telegram network errors), and the categorization as timeout errors makes sense. Test coverage is good but missing one case for ENETRESET. The impact is localized to failover logic with clear benefits for resilience. - Add test case for `ENETRESET` in `model-fallback.e2e.test.ts` for complete coverage <sub>Last reviewed commit: 43c1c9a</sub> <!-- greptile_other_comments_section --> <sub>(5/5) You can turn off certain types of comments like style [here](https://app.greptile.com/review/github)!</sub> <!-- /greptile_comment -->

Most Similar PRs