#13910: fix(discord): harden gateway reconnect recovery

by BYWallace open 2026-02-11 06:17 View on GitHub →

channel: discord channel: mattermost size: L

## Summary Paired upstream fix in the Carbon library as well: https://github.com/buape/carbon/pull/353 I was seeing OC hang repeatedly when running a bot in one of my Discord servers. Traced it down using dumped logs: ``` # DNS/network trigger 41 2026-02-10T04:52:04.753Z [discord] gateway error: Error: getaddrinfo ENOTFOUND gateway-us-east1-d.discord.gg # Repeated resume-loop pattern (1000ms + 1005 close) 184 2026-02-09T17:53:19.421Z [discord] gateway: Attempting resume with backoff: 1000ms after code 1006 185 2026-02-09T17:53:59.115Z [discord] gateway: Attempting resume with backoff: 1000ms 186 2026-02-09T17:53:59.203Z [discord] gateway: WebSocket connection closed with code 1005 187 2026-02-09T17:54:22.462Z [discord] gateway: Attempting resume with backoff: 1000ms 188 2026-02-09T17:54:22.545Z [discord] gateway: WebSocket connection closed with code 1005 189 2026-02-09T17:54:41.568Z [discord] gateway: Attempting resume with backoff: 1000ms 190 2026-02-09T17:54:41.654Z [discord] gateway: WebSocket connection closed with code 1005 # HELLO stall during same failure window 261 2026-02-09T18:06:56.095Z [discord] gateway: Attempting resume with backoff: 1000ms 262 2026-02-09T18:06:56.097Z [discord] connection stalled: no HELLO received within 30000ms, forcing reconnect 263 2026-02-09T18:06:56.184Z [discord] gateway: WebSocket connection closed with code 1005 ``` - add a Discord gateway recovery helper to detect repeated failed resume cycles and force a fresh IDENTIFY after consecutive failures - add an outer retry supervisor with bounded retries/backoff and deterministic exhaustion behavior - keep abort/shutdown safe by ignoring recovery events during teardown - promote `connection stable` gateway debug markers to info logs for operational visibility ## Why Issue [#13180](https://github.com/openclaw/openclaw/issues/13180) showed a long-running resume-loop failure mode where the bot could appear online but stop processing messages. This change makes recovery bounded and explicit instead of looping indefinitely. ## Key implementation details - new recovery module: `src/discord/monitor/gateway-recovery.ts` - tracks resume attempts/failures from gateway debug stream - trips after 3 consecutive resume failures (default) - clears stale session state and reconnects with fresh identify - separates stop-vs-retry error predicates - provider integration in `src/discord/monitor/provider.ts` - wraps gateway run in outer supervisor loop (defaults: max retries 5, 10s initial backoff, 1.8 factor, 120s cap, jitter 0.2) - preserves existing HELLO watchdog and emits `connection stable after 30s` - guards against reconnect races during abort/shutdown - logging update in `src/discord/gateway-logging.ts` (+ test) ## Tests Fully tested for touched behavior. Ran: - `pnpm test src/discord/monitor/gateway-recovery.test.ts src/discord/gateway-logging.test.ts src/discord/monitor.gateway.test.ts src/discord/monitor.test.ts src/discord/monitor.slash.test.ts` Results: - 5 files passed - 57 tests passed Additional in-progress validation: - Currently testing this branch against my local OpenClaw assistant setup for a day or two to validate the issue doesn't happen anymore. ## AI-assisted - AI-assisted: yes (Codex) - Degree of testing: fully tested (targeted suite above) - I understand and reviewed the final code paths and failure modes - Session logs/prompts: available on request Closes https://github.com/openclaw/openclaw/issues/13688