← Back to PRs

#17758: Fix crash on transient Discord gateway zombie connection errors

by DoyoDia open 2026-02-16 05:01 View on GitHub →
cli stale size: S
## Summary - **Problem:** `@buape/carbon`'s heartbeat timer throws `"Attempted to reconnect zombie connection"` inside a `setTimeout` callback when the WebSocket drops (e.g. network proxy switching nodes). This surfaces as an uncaught exception → `process.exit(1)`. - **Why it matters:** Any transient network interruption kills the gateway; the built-in reconnect logic (50 attempts, exponential backoff) never gets a chance to recover. - **What changed:** The three `uncaughtException` handlers now detect this specific error class and log a warning instead of crashing. - **What did NOT change:** No changes to `@buape/carbon`, gateway reconnection logic, or any other error handling paths. ## Change Type (select all) - [x] Bug fix ## User-visible / Behavior Changes Gateway no longer crashes when network proxy tools (Clash, etc.) switch nodes mid-connection. A warning is logged instead: ``` [openclaw] Transient gateway error (non-fatal): Attempted to reconnect zombie connection... ``` ## Security Impact (required) - New permissions/capabilities? `No` - Secrets/tokens handling changed? `No` - New/changed network calls? `No` - Command/tool execution surface changed? `No` - Data access scope changed? `No` ## Repro + Verification ### Environment - OS: Any - Runtime/container: Node 22+ - Integration/channel: Discord ### Steps 1. Run gateway with Discord connected 2. Switch network proxy node (or simulate WebSocket drop during heartbeat interval) 3. Observe gateway behavior ### Expected - Warning logged, gateway reconnects automatically ### Actual (before fix) - `process.exit(1)` on uncaught exception ## Evidence - [x] Failing test/log before + passing after - [ ] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) New test file `src/infra/errors.test.ts` — 4 tests covering `isTransientGatewayError` for both carbon error variants, unrelated errors, and non-Error values. ## Compatibility / Migration - Backward compatible? `Yes` - Config/env changes? `No` - Migration needed? `No` ## Risks and Mitigations - **Risk:** Swallowing an error that _should_ be fatal if the gateway is truly unrecoverable. - **Mitigation:** Detection is narrowly scoped to two exact substrings from carbon's source. The gateway's own reconnect logic (50 attempts + backoff) handles recovery; if that also fails, `"Max reconnect attempts"` still triggers a fatal exit. <!-- greptile_comment --> <h3>Greptile Summary</h3> Adds targeted error handling for transient Discord gateway zombie connection errors thrown by `@buape/carbon` during heartbeat reconnection attempts. The fix prevents unnecessary process crashes by catching these specific non-fatal errors in three entry points (`src/index.ts`, `src/cli/run-main.ts`, `src/macos/relay.ts`), logging a warning instead, and allowing the gateway's built-in reconnection logic (50 attempts with exponential backoff) to recover automatically. **Key changes:** - Added `isTransientGatewayError()` function that detects two specific error message patterns from the carbon library - Integrated detection into existing `uncaughtException` handlers before the fatal `process.exit(1)` path - Comprehensive test coverage with 4 test cases validating both matching and non-matching scenarios - Narrowly scoped to avoid swallowing truly fatal errors like "Max reconnect attempts" <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk - The implementation is narrowly scoped with string-based detection of specific transient errors, well-tested with comprehensive coverage, and properly integrated into existing error handling paths without disrupting fatal error detection like "Max reconnect attempts" - No files require special attention <sub>Last reviewed commit: 1cfe8e9</sub> <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs