#10731: fix(discord): add outer retry loop for gateway reconnect exhaustion
channel: discord
stale
Cluster:
Signal and Discord Fixes
## Summary
When a Discord gateway connection receives error code 1010 (Cloud Load Balancer) and the built-in resume/reconnect attempts are exhausted, the bot enters a permanent reconnect loop with no recovery path.
This PR adds:
- An **outer retry loop** that catches gateway exhaustion and creates a completely fresh gateway connection (new identify, new session)
- A **"connection stable" log marker** emitted after 60s of healthy connection, useful for monitoring
Fixes the WebSocket 1010 disconnect → infinite reconnect loop that occurs during Discord infrastructure maintenance windows.
## Changes
- `src/discord/monitor/provider.ts` — wrap gateway lifecycle in outer retry with exponential backoff (5s → 30min cap), fresh gateway on each attempt
- `src/discord/gateway-logging.ts` — add `connection_stable` event type
- `src/discord/gateway-logging.test.ts` — update test for new event
## Test plan
- [x] Deployed and running on 11 Discord bot agents in production
- [x] All bots successfully connected after deploy
- [ ] Verify no reconnect loops during next Discord maintenance window
- [ ] Monitor for "connection stable" log marker appearing after 60s
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
- Adds an outer retry loop around the Discord gateway lifecycle to recover from Carbon reconnect exhaustion by recreating a fresh `Client`/`GatewayPlugin`.
- Introduces exponential backoff between outer retries and limits the number of outer retries before throwing to mark the channel as dead.
- Extends gateway logging to promote a new `connection stable` debug marker to info-level logs and updates the corresponding unit test.
<h3>Confidence Score: 3/5</h3>
- This PR is close to mergeable but has a shutdown-path bug that can cause unexpected throws during abort.
- The core retry-loop approach and logging changes are straightforward, but `sleepWithAbort()` throws on abort and that exception is not handled in the new outer backoff path, so clean shutdown can turn into an error exit when abort happens during backoff sleep.
- src/discord/monitor/provider.ts
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#15762: fix(discord): add circuit breaker for WebSocket resume loop
by funmerlin · 2026-02-13
83.0%
#16736: fix: stagger multi-account channel startup to avoid Discord rate li...
by rm289 · 2026-02-15
81.4%
#13910: fix(discord): harden gateway reconnect recovery
by BYWallace · 2026-02-11
79.8%
#20967: fix(discord): report connected state so health-monitor can restart ...
by who96 · 2026-02-19
78.7%
#23158: discord: harden preflight/reply path against slow lookup latency
by danielstarman · 2026-02-22
76.8%
#19615: fix(discord): include default account when sub-accounts are configured
by prue-starfield · 2026-02-18
76.1%
#16801: fix: Register Discord listeners before gateway connects
by MisterGuy420 · 2026-02-15
75.8%
#17758: Fix crash on transient Discord gateway zombie connection errors
by DoyoDia · 2026-02-16
75.6%
#6302: fix: Add timeouts to prevent indefinite hangs (issues #4954, #4956,...
by batumilove · 2026-02-01
75.2%
#14259: fix(discord): add timeout to restart to prevent duplicate responses
by George5562 · 2026-02-11
74.6%