#13910: fix(discord): harden gateway reconnect recovery
channel: discord
channel: mattermost
size: L
Cluster:
Signal and Discord Fixes
## Summary
Paired upstream fix in the Carbon library as well:
https://github.com/buape/carbon/pull/353
I was seeing OC hang repeatedly when running a bot in one of my Discord servers. Traced it down using dumped logs:
```
# DNS/network trigger
41 2026-02-10T04:52:04.753Z [discord] gateway error: Error: getaddrinfo ENOTFOUND gateway-us-east1-d.discord.gg
# Repeated resume-loop pattern (1000ms + 1005 close)
184 2026-02-09T17:53:19.421Z [discord] gateway: Attempting resume with backoff: 1000ms after code 1006
185 2026-02-09T17:53:59.115Z [discord] gateway: Attempting resume with backoff: 1000ms
186 2026-02-09T17:53:59.203Z [discord] gateway: WebSocket connection closed with code 1005
187 2026-02-09T17:54:22.462Z [discord] gateway: Attempting resume with backoff: 1000ms
188 2026-02-09T17:54:22.545Z [discord] gateway: WebSocket connection closed with code 1005
189 2026-02-09T17:54:41.568Z [discord] gateway: Attempting resume with backoff: 1000ms
190 2026-02-09T17:54:41.654Z [discord] gateway: WebSocket connection closed with code 1005
# HELLO stall during same failure window
261 2026-02-09T18:06:56.095Z [discord] gateway: Attempting resume with backoff: 1000ms
262 2026-02-09T18:06:56.097Z [discord] connection stalled: no HELLO received within 30000ms, forcing reconnect
263 2026-02-09T18:06:56.184Z [discord] gateway: WebSocket connection closed with code 1005
```
- add a Discord gateway recovery helper to detect repeated failed resume cycles and force a fresh IDENTIFY after consecutive failures
- add an outer retry supervisor with bounded retries/backoff and deterministic exhaustion behavior
- keep abort/shutdown safe by ignoring recovery events during teardown
- promote `connection stable` gateway debug markers to info logs for operational visibility
## Why
Issue [#13180](https://github.com/openclaw/openclaw/issues/13180) showed a long-running resume-loop failure mode where the bot could appear online but stop processing messages. This change makes recovery bounded and explicit instead of looping indefinitely.
## Key implementation details
- new recovery module: `src/discord/monitor/gateway-recovery.ts`
- tracks resume attempts/failures from gateway debug stream
- trips after 3 consecutive resume failures (default)
- clears stale session state and reconnects with fresh identify
- separates stop-vs-retry error predicates
- provider integration in `src/discord/monitor/provider.ts`
- wraps gateway run in outer supervisor loop (defaults: max retries 5, 10s initial backoff, 1.8 factor, 120s cap, jitter 0.2)
- preserves existing HELLO watchdog and emits `connection stable after 30s`
- guards against reconnect races during abort/shutdown
- logging update in `src/discord/gateway-logging.ts` (+ test)
## Tests
Fully tested for touched behavior.
Ran:
- `pnpm test src/discord/monitor/gateway-recovery.test.ts src/discord/gateway-logging.test.ts src/discord/monitor.gateway.test.ts src/discord/monitor.test.ts src/discord/monitor.slash.test.ts`
Results:
- 5 files passed
- 57 tests passed
Additional in-progress validation:
- Currently testing this branch against my local OpenClaw assistant setup for a day or two to validate the issue doesn't happen anymore.
## AI-assisted
- AI-assisted: yes (Codex)
- Degree of testing: fully tested (targeted suite above)
- I understand and reviewed the final code paths and failure modes
- Session logs/prompts: available on request
Closes https://github.com/openclaw/openclaw/issues/13688
Most Similar PRs
#10731: fix(discord): add outer retry loop for gateway reconnect exhaustion
by Milofax · 2026-02-06
79.8%
#15762: fix(discord): add circuit breaker for WebSocket resume loop
by funmerlin · 2026-02-13
79.2%
#12234: gateway: incident tracking, recover command, and ciao ERR_SERVER_CL...
by levineam · 2026-02-09
73.5%
#16736: fix: stagger multi-account channel startup to avoid Discord rate li...
by rm289 · 2026-02-15
73.2%
#21944: feat(gateway): crash-loop protection with escalating backoff
by Protocol-zero-0 · 2026-02-20
71.8%
#20967: fix(discord): report connected state so health-monitor can restart ...
by who96 · 2026-02-19
71.6%
#17758: Fix crash on transient Discord gateway zombie connection errors
by DoyoDia · 2026-02-16
71.4%
#21931: feat(config): auto-rollback to last known-good backup on invalid co...
by Protocol-zero-0 · 2026-02-20
70.3%
#19615: fix(discord): include default account when sub-accounts are configured
by prue-starfield · 2026-02-18
69.8%
#23158: discord: harden preflight/reply path against slow lookup latency
by danielstarman · 2026-02-22
67.9%