#15762: fix(discord): add circuit breaker for WebSocket resume loop
channel: discord
size: S
Cluster:
Signal and Discord Fixes
## Problem
The Discord WebSocket connection can enter an unrecoverable resume loop where it endlessly retries with a stale session token. Observed in production: **1,400+ reconnect attempts over 12+ hours** before manual intervention.
### Root cause
When the WS opens but never receives a HELLO (Discord gateway stall), OpenClaw's zombie timeout handler calls `gateway.disconnect()` + `gateway.connect(false)` to force a reconnect. However, the underlying library (`@buape/carbon`) resets `reconnectAttempts = 0` on every WebSocket open event, so the library's own circuit breaker (`maxAttempts`) is never reached.
The zombie timeout effectively creates an infinite loop:
1. Connect → WS opens → counter resets to 0
2. No HELLO arrives within 30s → zombie timeout fires
3. Disconnect + reconnect (resume) → go to 1
4. Session token is stale → resume always fails silently
### Timeline from production logs (Feb 13, 2026)
| Time | Event | Duration |
|------|-------|----------|
| 07:30 | Gateway boot, Discord login | 25 min stable |
| 07:55 | First stall → resume loop begins | 341 attempts |
| 10:26 | Briefly self-recovers | 6 min stable |
| 10:32 | Loop resumes | 222 attempts |
| 13:54 | Self-recovers | 3 min stable |
| 13:57 | Loop resumes | 73 attempts |
| 15:45 | Self-recovers | 2 min stable |
| 15:48 | Loop resumes | 79 attempts |
| 17:09 | Full gateway restart | 2.5+ hours stable |
Total: **717 resume attempts**, **36 connection stalls**, **708 WS close code 1005**.
## Fix
Add an application-level circuit breaker to the zombie timeout handler:
- Track consecutive stalls (WS opens but no HELLO within 30s)
- After 5 consecutive stalls, **invalidate the session state** (`sessionId`, `resumeGatewayUrl`) and force a fresh `IDENTIFY` instead of trying to resume with a dead session token
- Log stall count on each attempt for observability
- Reset counter on successful HELLO receipt
This breaks the loop because a fresh IDENTIFY creates a new session rather than trying to resume a stale one.
## Changes
- `src/discord/monitor/provider.ts`: Added `MAX_STALL_RETRIES` (5) and `consecutiveStalls` counter to the zombie timeout handler. On circuit breaker trip, nullifies `gateway.state.sessionId` and `gateway.state.resumeGatewayUrl` before reconnecting.
Fixes #13180
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This change adds an application-level circuit breaker to Discord gateway zombie-connection handling in `src/discord/monitor/provider.ts`. It tracks consecutive stalls where the WebSocket opens but no HELLO is observed within 30s; after 5 stalls it clears the gateway session identifiers before reconnecting, forcing a fresh IDENTIFY instead of endlessly resuming a stale session.
The logic is implemented by listening to gateway `debug` messages, resetting the stall counter on HELLO-related markers, and incrementing/reconnecting when the HELLO timeout expires after a connection-open event.
<h3>Confidence Score: 3/5</h3>
- This PR is likely safe, but depends on @buape/carbon gateway internals and debug message formats.
- The change is small and localized, but it relies on parsing gateway debug strings for HELLO detection and directly mutating `gateway.state` fields via casts. In this environment the carbon implementation is not available to verify that these markers always appear and that `state.sessionId`/`state.resumeGatewayUrl` exist and are intended to be mutated, so there is residual integration risk.
- src/discord/monitor/provider.ts
<sub>Last reviewed commit: 3cc5974</sub>
<!-- greptile_other_comments_section -->
<sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#10731: fix(discord): add outer retry loop for gateway reconnect exhaustion
by Milofax · 2026-02-06
83.0%
#17758: Fix crash on transient Discord gateway zombie connection errors
by DoyoDia · 2026-02-16
79.7%
#13910: fix(discord): harden gateway reconnect recovery
by BYWallace · 2026-02-11
79.2%
#20967: fix(discord): report connected state so health-monitor can restart ...
by who96 · 2026-02-19
76.9%
#16736: fix: stagger multi-account channel startup to avoid Discord rate li...
by rm289 · 2026-02-15
74.9%
#13084: fix(daemon): multi-layer defense against zombie gateway processes
by openperf · 2026-02-10
73.9%
#12234: gateway: incident tracking, recover command, and ciao ERR_SERVER_CL...
by levineam · 2026-02-09
73.9%
#16125: feat(gateway): add stuck session detection
by CyberSinister · 2026-02-14
73.9%
#14993: fix(webchat): add heartbeat detection to prevent zombie WebSocket c...
by BenediktSchackenberg · 2026-02-12
73.7%
#6302: fix: Add timeouts to prevent indefinite hangs (issues #4954, #4956,...
by batumilove · 2026-02-01
73.3%