#15762: fix(discord): add circuit breaker for WebSocket resume loop

by funmerlin open 2026-02-13 21:11 View on GitHub →

channel: discord size: S

## Problem The Discord WebSocket connection can enter an unrecoverable resume loop where it endlessly retries with a stale session token. Observed in production: **1,400+ reconnect attempts over 12+ hours** before manual intervention. ### Root cause When the WS opens but never receives a HELLO (Discord gateway stall), OpenClaw's zombie timeout handler calls `gateway.disconnect()` + `gateway.connect(false)` to force a reconnect. However, the underlying library (`@buape/carbon`) resets `reconnectAttempts = 0` on every WebSocket open event, so the library's own circuit breaker (`maxAttempts`) is never reached. The zombie timeout effectively creates an infinite loop: 1. Connect → WS opens → counter resets to 0 2. No HELLO arrives within 30s → zombie timeout fires 3. Disconnect + reconnect (resume) → go to 1 4. Session token is stale → resume always fails silently ### Timeline from production logs (Feb 13, 2026) | Time | Event | Duration | |------|-------|----------| | 07:30 | Gateway boot, Discord login | 25 min stable | | 07:55 | First stall → resume loop begins | 341 attempts | | 10:26 | Briefly self-recovers | 6 min stable | | 10:32 | Loop resumes | 222 attempts | | 13:54 | Self-recovers | 3 min stable | | 13:57 | Loop resumes | 73 attempts | | 15:45 | Self-recovers | 2 min stable | | 15:48 | Loop resumes | 79 attempts | | 17:09 | Full gateway restart | 2.5+ hours stable | Total: **717 resume attempts**, **36 connection stalls**, **708 WS close code 1005**. ## Fix Add an application-level circuit breaker to the zombie timeout handler: - Track consecutive stalls (WS opens but no HELLO within 30s) - After 5 consecutive stalls, **invalidate the session state** (`sessionId`, `resumeGatewayUrl`) and force a fresh `IDENTIFY` instead of trying to resume with a dead session token - Log stall count on each attempt for observability - Reset counter on successful HELLO receipt This breaks the loop because a fresh IDENTIFY creates a new session rather than trying to resume a stale one. ## Changes - `src/discord/monitor/provider.ts`: Added `MAX_STALL_RETRIES` (5) and `consecutiveStalls` counter to the zombie timeout handler. On circuit breaker trip, nullifies `gateway.state.sessionId` and `gateway.state.resumeGatewayUrl` before reconnecting. Fixes #13180  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This change adds an application-level circuit breaker to Discord gateway zombie-connection handling in `src/discord/monitor/provider.ts`. It tracks consecutive stalls where the WebSocket opens but no HELLO is observed within 30s; after 5 stalls it clears the gateway session identifiers before reconnecting, forcing a fresh IDENTIFY instead of endlessly resuming a stale session. The logic is implemented by listening to gateway `debug` messages, resetting the stall counter on HELLO-related markers, and incrementing/reconnecting when the HELLO timeout expires after a connection-open event. <h3>Confidence Score: 3/5</h3> - This PR is likely safe, but depends on @buape/carbon gateway internals and debug message formats. - The change is small and localized, but it relies on parsing gateway debug strings for HELLO detection and directly mutating `gateway.state` fields via casts. In this environment the carbon implementation is not available to verify that these markers always appear and that `state.sessionId`/`state.resumeGatewayUrl` exist and are intended to be mutated, so there is residual integration risk. - src/discord/monitor/provider.ts <sub>Last reviewed commit: 3cc5974</sub>  <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>