#16944: fix: retry transient WebSocket 1006 closures in callGateway + announce flow
gateway
agents
stale
size: XS
Cluster:
Subagent Enhancements and Features
## Summary
Adds automatic retry with exponential backoff for transient WebSocket 1006 (abnormal closure) errors in `callGateway()`. Previously, a single 1006 failure permanently lost subagent announce messages with no recovery path.
## Problem
Long-running subagent sessions (10-30 min coding CLIs) outlive the gateway's WebSocket connection, which gets silently reaped. When the subagent completes and tries to announce results back to the parent session, the dead connection causes an unrecoverable failure:
```
Subagent announce failed: Error: gateway closed (1006 abnormal closure (no close frame)): no close reason
```
This was observed 5+ times in a single day across subagent announces and tool calls. See #16937 for full details.
## Changes
### `src/gateway/call.ts`
- `callGateway()` now wraps `callGatewayOnce()` with retry logic
- New `retries` option (default: 2) — configurable per call site
- Exponential backoff: 1s, 2s, 4s between retries
- Only retries on code 1006 (abnormal closure); auth failures, timeouts, and normal closures throw immediately
### `src/agents/subagent-announce.ts`
- `sendAnnounce()`: `retries: 3` (announce queue delivery)
- `runSubagentAnnounceFlow()` direct announce: `retries: 3` (critical path)
## Why These Defaults
- `callGateway` default retries=2: safe for general use, adds ~3s worst case
- Announce paths retries=3: announce delivery is user-visible; losing a completion notification is worse than a brief retry delay
- The `idempotencyKey` on announce calls ensures retried deliveries are deduplicated by the gateway
Fixes #16937
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Added automatic retry with exponential backoff for WebSocket 1006 (abnormal closure) errors in `callGateway()`. The implementation wraps the existing `callGatewayOnce()` function with retry logic that defaults to 2 retries with 1s, 2s, 4s delays. Two critical announce paths in `subagent-announce.ts` use `retries: 3` to ensure completion notifications are delivered reliably.
Key changes:
- New retry wrapper around `callGateway()` with configurable `retries` option
- Only retries on code 1006; other errors (auth, timeout, normal close) fail immediately
- Parses close codes from error messages via regex on `formatCloseError` output
- Subagent announce delivery uses `retries: 3` for critical paths
- Idempotency keys on announce calls ensure deduplicated delivery
**Note**: Two existing tests in `call.test.ts` are likely affected by this change:
1. Line 297: `"does not overflow very large timeout values"` uses fake timers and triggers `onClose(1006, "")` - will hang without advancing timers for retry delays or passing `retries: 0`
2. Line 251: `"includes connection details when the gateway closes"` with `closeCode = 1006` - will take ~3 seconds real time to exhaust retries unless `retries: 0` is passed
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge with minimal risk once test compatibility is addressed
- The retry logic is well-scoped to transient 1006 errors only. Idempotency keys prevent duplicate delivery. The main concerns are test compatibility (fake timers will hang, real timers add 3s delay) and the fragile regex parsing of close codes from error messages. The exponential backoff could theoretically overflow but only at very high retry counts. Default values are reasonable for production use.
- The test file `call.test.ts` needs updates to handle retry behavior (pass `retries: 0` or advance fake timers appropriately)
<sub>Last reviewed commit: d744fc1</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#17001: fix: retry sub-agent announcements with backoff instead of silently...
by luisecab · 2026-02-15
87.0%
#17028: fix(subagent): retry announce on timeout
by Limitless2023 · 2026-02-15
83.3%
#20328: fix(agents): Add retry with exponential backoff for subagent announ...
by tiny-ship-it · 2026-02-18
82.6%
#18205: fix (agents): add periodic retry timer for failed subagent announces
by MegaPhoenix92 · 2026-02-16
82.1%
#13105: fix: debounce subagent lifecycle events to prevent premature announ...
by mcaxtr · 2026-02-10
77.5%
#17721: fix: abort child run on subagent timeout + retry with backoff + sta...
by IrriVisionTechnologies · 2026-02-16
77.1%
#22719: fix(agents): make subagent announce timeout configurable (restore 6...
by Valadon · 2026-02-21
76.8%
#19243: fix(announce-queue): cap per-item send retries to prevent infinite ...
by taw0002 · 2026-02-17
75.2%
#16239: fix: retry on transient API errors (overloaded, rate-limit, timeout)
by zerone0x · 2026-02-14
75.1%
#6466: fix(gateway): add handshake timeout and connection error handling
by jarvis-raven · 2026-02-01
75.1%