← Back to PRs

#10731: fix(discord): add outer retry loop for gateway reconnect exhaustion

by Milofax open 2026-02-06 22:20 View on GitHub →
channel: discord stale
## Summary When a Discord gateway connection receives error code 1010 (Cloud Load Balancer) and the built-in resume/reconnect attempts are exhausted, the bot enters a permanent reconnect loop with no recovery path. This PR adds: - An **outer retry loop** that catches gateway exhaustion and creates a completely fresh gateway connection (new identify, new session) - A **"connection stable" log marker** emitted after 60s of healthy connection, useful for monitoring Fixes the WebSocket 1010 disconnect → infinite reconnect loop that occurs during Discord infrastructure maintenance windows. ## Changes - `src/discord/monitor/provider.ts` — wrap gateway lifecycle in outer retry with exponential backoff (5s → 30min cap), fresh gateway on each attempt - `src/discord/gateway-logging.ts` — add `connection_stable` event type - `src/discord/gateway-logging.test.ts` — update test for new event ## Test plan - [x] Deployed and running on 11 Discord bot agents in production - [x] All bots successfully connected after deploy - [ ] Verify no reconnect loops during next Discord maintenance window - [ ] Monitor for "connection stable" log marker appearing after 60s 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> - Adds an outer retry loop around the Discord gateway lifecycle to recover from Carbon reconnect exhaustion by recreating a fresh `Client`/`GatewayPlugin`. - Introduces exponential backoff between outer retries and limits the number of outer retries before throwing to mark the channel as dead. - Extends gateway logging to promote a new `connection stable` debug marker to info-level logs and updates the corresponding unit test. <h3>Confidence Score: 3/5</h3> - This PR is close to mergeable but has a shutdown-path bug that can cause unexpected throws during abort. - The core retry-loop approach and logging changes are straightforward, but `sleepWithAbort()` throws on abort and that exception is not handled in the new outer backoff path, so clean shutdown can turn into an error exit when abort happens during backoff sleep. - src/discord/monitor/provider.ts <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs