#19243: fix(announce-queue): cap per-item send retries to prevent infinite loop
agents
size: XS
Cluster:
Subagent Enhancements and Features
## Summary
Adds a per-item send retry cap (5 attempts) to the announce queue drain loop, preventing infinite retry loops when `sendAnnounce` consistently fails.
## Problem
A `sessions_spawn` subagent completion notification kept firing every ~5-10 seconds **indefinitely** (8+ hours in the reported case), wasting significant tokens on the parent session. See #19197.
## Root Cause
The announce queue drain loop in `subagent-announce-queue.ts` has **no per-item retry limit**. The flow:
1. Subagent completes → `runSubagentAnnounceFlow` runs
2. `maybeQueueSubagentAnnounce` returns `"queued"` → the registry considers announce **delivered** (`didAnnounce = true`)
3. `finalizeSubagentCleanup` marks cleanup complete
4. The queue's `scheduleAnnounceDrain` tries to send via `sendAnnounce` → `callGateway({method: "agent"})`
5. If `callGateway` throws (timeout, connection error, etc.), the `catch` block keeps the item in queue and reschedules the drain
6. **Loop repeats indefinitely** — the registry-level retry cap (3 attempts) is bypassed because the registry already finalized at step 3
The existing deterministic idempotency keys (`announceId`) prevent duplicate agent turns at the gateway level, but each timed-out retry still costs a `callGateway` round-trip and debounce cycle.
## Fix
Add a `_sendAttempts` counter to `AnnounceQueueItem`. In the catch block of `scheduleAnnounceDrain`:
- Increment the counter on each failure
- When `_sendAttempts >= MAX_SEND_ATTEMPTS_PER_ITEM` (5), drop the item with an error log
- Otherwise, retry with debounce (existing behavior) but now with attempt tracking in the log
## Test Results
All 3 existing announce queue tests pass. Error messages now include attempt counts for observability.
Fixes #19197
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds a per-item send retry cap (`MAX_SEND_ATTEMPTS_PER_ITEM = 5`) to the announce queue drain loop in `subagent-announce-queue.ts`, preventing the infinite retry loop described in #19197 where `sendAnnounce` failures caused 8+ hours of continuous retries.
- Adds a `_sendAttempts` counter to `AnnounceQueueItem`, incremented in the catch block on each failure
- Items are dropped with an error log once the retry limit is reached; otherwise retry continues with debounce (existing behavior)
- Error messages now include attempt counts for improved observability
- Complements the existing registry-level retry cap (3 attempts) which was bypassed once the queue marked announce as delivered
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — it adds a bounded retry cap to an existing unbounded retry loop, with correct logic and minimal blast radius.
- The change is small, focused, and addresses a well-documented production issue (#19197). The retry counter correctly tracks `queue.items[0]` across all send paths (followup, collect, summary modes). The `finally` block correctly re-schedules drain for remaining items after dropping one. Score is 4 rather than 5 because the existing tests don't cover the new cap behavior (they only verify single-failure retry), and in collect mode, N items could result in up to N*5 total attempts before all are dropped — bounded but potentially larger than expected.
- No files require special attention — the single changed file has straightforward logic.
<sub>Last reviewed commit: e74ddb1</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#21887: fix: drop stale announce-queue items after 5-minute TTL
by John-Rood · 2026-02-20
79.5%
#17028: fix(subagent): retry announce on timeout
by Limitless2023 · 2026-02-15
78.4%
#18205: fix (agents): add periodic retry timer for failed subagent announces
by MegaPhoenix92 · 2026-02-16
78.4%
#18468: fix(agents): prevent infinite retry loops in sub-agent completion a...
by BinHPdev · 2026-02-16
78.1%
#20328: fix(agents): Add retry with exponential backoff for subagent announ...
by tiny-ship-it · 2026-02-18
77.1%
#13105: fix: debounce subagent lifecycle events to prevent premature announ...
by mcaxtr · 2026-02-10
76.6%
#17001: fix: retry sub-agent announcements with backoff instead of silently...
by luisecab · 2026-02-15
76.6%
#22719: fix(agents): make subagent announce timeout configurable (restore 6...
by Valadon · 2026-02-21
75.2%
#16944: fix: retry transient WebSocket 1006 closures in callGateway + annou...
by sudobot99 · 2026-02-15
75.2%
#16239: fix: retry on transient API errors (overloaded, rate-limit, timeout)
by zerone0x · 2026-02-14
71.9%