#21516: fix: classify connection errors as timeout for model failover (#20931)
agents
size: S
trusted-contributor
## Problem
When MiniMax (or any provider) returns `"Connection error."`, `classifyFailoverReason()` returns `null` because the string doesn't match any `ERROR_PATTERNS`. The error is rethrown instead of triggering the configured fallback model chain.
This affects any transient network failure — ECONNREFUSED, ECONNRESET, ENOTFOUND, EPIPE, generic "network error", and Node.js "fetch failed" — none of which were classified as failover-worthy.
## Fix
Add connection-related patterns to `ERROR_PATTERNS.timeout`:
```
"connection error"
"connect error"
/\beconnrefused\b/
/\beconnreset\b/
/\benotfound\b/
/\bepipe\b/
"network error"
"fetch failed"
```
These are classified as `timeout` (transient transport failures), consistent with how `isTransientHttpError` already maps 5xx responses to `"timeout"`.
## Changes
- **`src/agents/pi-embedded-helpers/errors.ts`** — Add 8 new patterns to `ERROR_PATTERNS.timeout`
- **`src/agents/pi-embedded-helpers/errors.connection-failover.test.ts`** — 22 test cases covering:
- `classifyFailoverReason()` returns `"timeout"` for all connection error variants
- `isTimeoutErrorMessage()` matches all new patterns
- `isFailoverErrorMessage()` returns `true` for connection errors
- Negative cases: unrelated errors are not misclassified
## Testing
```
✓ classifyFailoverReason("Connection error.") returns "timeout"
✓ classifyFailoverReason("connect error: ECONNREFUSED") returns "timeout"
✓ classifyFailoverReason("ECONNREFUSED 127.0.0.1:443") returns "timeout"
✓ classifyFailoverReason("ECONNRESET by peer") returns "timeout"
✓ classifyFailoverReason("ENOTFOUND api.minimax.io") returns "timeout"
✓ classifyFailoverReason("EPIPE: broken pipe") returns "timeout"
✓ classifyFailoverReason("network error") returns "timeout"
✓ classifyFailoverReason("TypeError: fetch failed") returns "timeout"
... (22 tests total, all passing)
```
## Note
`ECONNRESET` and `ECONNREFUSED` were already handled in `resolveFailoverReasonFromError()` via the `error.code` path (see `failover-error.ts:147`). However, when these codes appear only in the error *message* string (as MiniMax does), they were not matched. This PR closes that gap.
Closes #20931
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds 8 new connection error patterns to `ERROR_PATTERNS.timeout` to enable model failover for transient network failures. Previously, connection errors like "Connection error." from MiniMax were not classified, causing the error to be rethrown instead of triggering the fallback model chain. The fix correctly classifies these as `timeout` errors (transient transport failures), consistent with how `isTransientHttpError` maps 5xx responses. Comprehensive test coverage with 22 test cases validates all new patterns and includes negative cases to prevent misclassification.
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk
- The changes are well-contained, thoroughly tested, and follow existing patterns in the codebase. The new error patterns use proper word boundaries to prevent false positives, and the classification logic is consistent with how other transient errors are handled. All 22 tests pass and cover both positive and negative cases.
- No files require special attention
<sub>Last reviewed commit: 6957517</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#21033: fix(failover): classify connection errors as timeout for model fail...
by zerone0x · 2026-02-19
90.6%
#5031: fix: add network connection error codes to failover classifier
by shayan919293 · 2026-01-30
82.7%
#15163: fix(errors): classify connection errors as retryable failover reason
by fagemx · 2026-02-13
81.7%
#22359: fix(agents): classify overloaded service errors as timeout
by AIflow-Labs · 2026-02-21
80.8%
#19077: fix(agents): trigger model failover on connection-refused and netwo...
by ayanesakura · 2026-02-17
80.6%
#21017: fix: treat HTTP 502/503/504 as failover-eligible (timeout reason)
by taw0002 · 2026-02-19
80.5%
#15815: Fallback LLM doesn't trigger if primary model is local
by shihanqu · 2026-02-13
79.7%
#12314: fix: treat HTTP 5xx server errors as failover-worthy
by hsssgdtc · 2026-02-09
77.3%
#17231: fix(failover): recognize model_cooldown as rate-limit for fallback
by thebtf · 2026-02-15
76.9%
#7229: fix: add network error resilience to agentic loop failover
by ai-fanatic · 2026-02-02
76.7%