#7229: fix: add network error resilience to agentic loop failover
agents
Cluster:
Error Resilience and Retry Logic
## Summary
Addresses issue #7185 where network errors like `fetch failed` caused the agentic loop to silently die with unhandled promise rejections. This is a critical reliability issue for enterprise-grade autonomous agents that need to operate continuously through network instability.
### Changes
- **New `network` FailoverReason type**: Added a dedicated failover reason for transient network errors, distinct from `timeout`
- **Comprehensive network error detection**: Detects fetch failures, DNS errors (ENOTFOUND), connection errors (ECONNRESET, ECONNREFUSED, ECONNABORTED), socket errors, TLS/SSL failures, and gateway errors (502, 503)
- **Transient error handling**: Network errors don't mark auth profiles as failed (unlike auth/billing errors) since they're infrastructure issues, not credential issues
- **Automatic retry on network errors**: Retries immediately with the same profile before attempting model failover
- **Production visibility**: Logs network errors for debugging without polluting session transcripts
- **HTTP 503 status mapping**: Returns appropriate status code for network errors in FailoverError
### Enterprise Reliability Pattern
This fix implements the "transient fault handling" pattern essential for autonomous AI agents:
1. **Detection**: Classify network errors distinctly from auth/billing failures
2. **Retry**: Automatic retry with exponential backoff (via the existing retry loop)
3. **Visibility**: Log errors without corrupting session state
4. **Graceful degradation**: Fall back to alternate models only after retry exhaustion
### Error Patterns Now Handled
| Error Type | Example | Previous Behavior | New Behavior |
|------------|---------|-------------------|--------------|
| Fetch failure | `TypeError: fetch failed` | Silent death | Retry → Failover |
| DNS errors | `ENOTFOUND api.anthropic.com` | Silent death | Retry → Failover |
| Connection reset | `ECONNRESET` | Treated as timeout | Retry as network error |
| Gateway errors | `502 Bad Gateway` | Not handled | Retry → Failover |
| Socket errors | `socket hang up` | Silent death | Retry → Failover |
## Test plan
- [x] Unit tests for `classifyFailoverReason` with network error patterns
- [x] Unit tests for `resolveFailoverReasonFromError` with error codes
- [x] Unit tests for `coerceToFailoverError` with 503 status mapping
- [x] Run full pi-embedded-runner test suite (129 tests pass)
## Related
Fixes #7185
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR extends the embedded agent failover system with a new `network` failover reason and broad network-error classification (message patterns + error codes). It updates `FailoverError` status mapping (network → 503) and teaches `runEmbeddedPiAgent` to treat network failures as transient (don’t mark auth profiles as failed; retry before attempting profile rotation/model fallback). Unit tests were added to cover the new classification paths.
The main risk is in `src/agents/pi-embedded-runner/run.ts`: network errors currently short-circuit the failover/rotation logic by `continue`-ing immediately, which can lead to unbounded retries during persistent outages and can also log “Profile undefined …” in aws-sdk/no-profile scenarios.
<h3>Confidence Score: 2/5</h3>
- Not safe to merge as-is due to a likely infinite retry behavior on network failures.
- While the error classification changes look coherent and are covered by tests, the new control-flow in `runEmbeddedPiAgent` can `continue` indefinitely on network errors without consuming a retry budget or progressing to auth-profile rotation/model fallback, which is a serious reliability regression under sustained network outages. There’s also a smaller logging correctness issue when `lastProfileId` is undefined.
- src/agents/pi-embedded-runner/run.ts
<!-- greptile_other_comments_section -->
**Context used:**
- Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=fd949e91-5c3a-4ab5-90a1-cbe184fd6ce8))
- Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=0d0c8278-ef8e-4d6c-ab21-f5527e322f13))
<!-- /greptile_comment -->
Most Similar PRs
#9232: Fix: Add automatic retry for network errors in message runs
by vishaltandale00 · 2026-02-05
84.1%
#19077: fix(agents): trigger model failover on connection-refused and netwo...
by ayanesakura · 2026-02-17
83.0%
#12314: fix: treat HTTP 5xx server errors as failover-worthy
by hsssgdtc · 2026-02-09
82.1%
#10178: fix: trigger fallback when model resolution fails with unknown model
by Yida-Dev · 2026-02-06
81.9%
#15815: Fallback LLM doesn't trigger if primary model is local
by shihanqu · 2026-02-13
80.9%
#5031: fix: add network connection error codes to failover classifier
by shayan919293 · 2026-01-30
80.6%
#11821: fix(auth): trigger failover on 401 status code from expired OAuth t...
by AnonO6 · 2026-02-08
80.1%
#21033: fix(failover): classify connection errors as timeout for model fail...
by zerone0x · 2026-02-19
79.5%
#21152: fix(agents): throw FailoverError for unknown model so fallback chai...
by Mellowambience · 2026-02-19
79.3%
#4036: fix: include cause detail in agent connection error diagnostic
by anajuliabit · 2026-01-29
79.1%