#17758: Fix crash on transient Discord gateway zombie connection errors
cli
stale
size: S
Cluster:
Signal and Discord Fixes
## Summary
- **Problem:** `@buape/carbon`'s heartbeat timer throws `"Attempted to reconnect zombie connection"` inside a `setTimeout` callback when the WebSocket drops (e.g. network proxy switching nodes). This surfaces as an uncaught exception → `process.exit(1)`.
- **Why it matters:** Any transient network interruption kills the gateway; the built-in reconnect logic (50 attempts, exponential backoff) never gets a chance to recover.
- **What changed:** The three `uncaughtException` handlers now detect this specific error class and log a warning instead of crashing.
- **What did NOT change:** No changes to `@buape/carbon`, gateway reconnection logic, or any other error handling paths.
## Change Type (select all)
- [x] Bug fix
## User-visible / Behavior Changes
Gateway no longer crashes when network proxy tools (Clash, etc.) switch nodes mid-connection. A warning is logged instead:
```
[openclaw] Transient gateway error (non-fatal): Attempted to reconnect zombie connection...
```
## Security Impact (required)
- New permissions/capabilities? `No`
- Secrets/tokens handling changed? `No`
- New/changed network calls? `No`
- Command/tool execution surface changed? `No`
- Data access scope changed? `No`
## Repro + Verification
### Environment
- OS: Any
- Runtime/container: Node 22+
- Integration/channel: Discord
### Steps
1. Run gateway with Discord connected
2. Switch network proxy node (or simulate WebSocket drop during heartbeat interval)
3. Observe gateway behavior
### Expected
- Warning logged, gateway reconnects automatically
### Actual (before fix)
- `process.exit(1)` on uncaught exception
## Evidence
- [x] Failing test/log before + passing after
- [ ] Trace/log snippets
- [ ] Screenshot/recording
- [ ] Perf numbers (if relevant)
New test file `src/infra/errors.test.ts` — 4 tests covering `isTransientGatewayError` for both carbon error variants, unrelated errors, and non-Error values.
## Compatibility / Migration
- Backward compatible? `Yes`
- Config/env changes? `No`
- Migration needed? `No`
## Risks and Mitigations
- **Risk:** Swallowing an error that _should_ be fatal if the gateway is truly unrecoverable.
- **Mitigation:** Detection is narrowly scoped to two exact substrings from carbon's source. The gateway's own reconnect logic (50 attempts + backoff) handles recovery; if that also fails, `"Max reconnect attempts"` still triggers a fatal exit.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds targeted error handling for transient Discord gateway zombie connection errors thrown by `@buape/carbon` during heartbeat reconnection attempts. The fix prevents unnecessary process crashes by catching these specific non-fatal errors in three entry points (`src/index.ts`, `src/cli/run-main.ts`, `src/macos/relay.ts`), logging a warning instead, and allowing the gateway's built-in reconnection logic (50 attempts with exponential backoff) to recover automatically.
**Key changes:**
- Added `isTransientGatewayError()` function that detects two specific error message patterns from the carbon library
- Integrated detection into existing `uncaughtException` handlers before the fatal `process.exit(1)` path
- Comprehensive test coverage with 4 test cases validating both matching and non-matching scenarios
- Narrowly scoped to avoid swallowing truly fatal errors like "Max reconnect attempts"
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk
- The implementation is narrowly scoped with string-based detection of specific transient errors, well-tested with comprehensive coverage, and properly integrated into existing error handling paths without disrupting fatal error detection like "Max reconnect attempts"
- No files require special attention
<sub>Last reviewed commit: 1cfe8e9</sub>
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#15762: fix(discord): add circuit breaker for WebSocket resume loop
by funmerlin · 2026-02-13
79.7%
#11101: fix: handle AbortError and WebSocket 1006 in unhandled rejection ha...
by Nipurn123 · 2026-02-07
78.9%
#23787: Handle transient Slack request errors without crashing the gateway
by graysurf · 2026-02-22
78.8%
#7558: fix: Handle Grammy/Telegram network errors to prevent gateway crashes
by kaigritun · 2026-02-03
78.2%
#4653: fix(gateway): improve crash resilience for mDNS and network errors
by AyedAlmudarra · 2026-01-30
77.8%
#21163: Prevent Slack DNS errors from crashing the gateway
by graysurf · 2026-02-19
77.8%
#10034: Don't crash gateway on transient unhandled fetch failures
by gigq · 2026-02-06
76.6%
#21944: feat(gateway): crash-loop protection with escalating backoff
by Protocol-zero-0 · 2026-02-20
76.0%
#7563: fix: expand transient network error detection
by kaigritun · 2026-02-03
75.9%
#12234: gateway: incident tracking, recover command, and ciao ERR_SERVER_CL...
by levineam · 2026-02-09
75.8%