← Back to PRs

#16944: fix: retry transient WebSocket 1006 closures in callGateway + announce flow

by sudobot99 open 2026-02-15 08:12 View on GitHub →
gateway agents stale size: XS
## Summary Adds automatic retry with exponential backoff for transient WebSocket 1006 (abnormal closure) errors in `callGateway()`. Previously, a single 1006 failure permanently lost subagent announce messages with no recovery path. ## Problem Long-running subagent sessions (10-30 min coding CLIs) outlive the gateway's WebSocket connection, which gets silently reaped. When the subagent completes and tries to announce results back to the parent session, the dead connection causes an unrecoverable failure: ``` Subagent announce failed: Error: gateway closed (1006 abnormal closure (no close frame)): no close reason ``` This was observed 5+ times in a single day across subagent announces and tool calls. See #16937 for full details. ## Changes ### `src/gateway/call.ts` - `callGateway()` now wraps `callGatewayOnce()` with retry logic - New `retries` option (default: 2) — configurable per call site - Exponential backoff: 1s, 2s, 4s between retries - Only retries on code 1006 (abnormal closure); auth failures, timeouts, and normal closures throw immediately ### `src/agents/subagent-announce.ts` - `sendAnnounce()`: `retries: 3` (announce queue delivery) - `runSubagentAnnounceFlow()` direct announce: `retries: 3` (critical path) ## Why These Defaults - `callGateway` default retries=2: safe for general use, adds ~3s worst case - Announce paths retries=3: announce delivery is user-visible; losing a completion notification is worse than a brief retry delay - The `idempotencyKey` on announce calls ensures retried deliveries are deduplicated by the gateway Fixes #16937 <!-- greptile_comment --> <h3>Greptile Summary</h3> Added automatic retry with exponential backoff for WebSocket 1006 (abnormal closure) errors in `callGateway()`. The implementation wraps the existing `callGatewayOnce()` function with retry logic that defaults to 2 retries with 1s, 2s, 4s delays. Two critical announce paths in `subagent-announce.ts` use `retries: 3` to ensure completion notifications are delivered reliably. Key changes: - New retry wrapper around `callGateway()` with configurable `retries` option - Only retries on code 1006; other errors (auth, timeout, normal close) fail immediately - Parses close codes from error messages via regex on `formatCloseError` output - Subagent announce delivery uses `retries: 3` for critical paths - Idempotency keys on announce calls ensure deduplicated delivery **Note**: Two existing tests in `call.test.ts` are likely affected by this change: 1. Line 297: `"does not overflow very large timeout values"` uses fake timers and triggers `onClose(1006, "")` - will hang without advancing timers for retry delays or passing `retries: 0` 2. Line 251: `"includes connection details when the gateway closes"` with `closeCode = 1006` - will take ~3 seconds real time to exhaust retries unless `retries: 0` is passed <h3>Confidence Score: 4/5</h3> - This PR is safe to merge with minimal risk once test compatibility is addressed - The retry logic is well-scoped to transient 1006 errors only. Idempotency keys prevent duplicate delivery. The main concerns are test compatibility (fake timers will hang, real timers add 3s delay) and the fragile regex parsing of close codes from error messages. The exponential backoff could theoretically overflow but only at very high retry counts. Default values are reasonable for production use. - The test file `call.test.ts` needs updates to handle retry behavior (pass `retries: 0` or advance fake timers appropriately) <sub>Last reviewed commit: d744fc1</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs