← Back to PRs

#19243: fix(announce-queue): cap per-item send retries to prevent infinite loop

by taw0002 open 2026-02-17 15:18 View on GitHub →
agents size: XS
## Summary Adds a per-item send retry cap (5 attempts) to the announce queue drain loop, preventing infinite retry loops when `sendAnnounce` consistently fails. ## Problem A `sessions_spawn` subagent completion notification kept firing every ~5-10 seconds **indefinitely** (8+ hours in the reported case), wasting significant tokens on the parent session. See #19197. ## Root Cause The announce queue drain loop in `subagent-announce-queue.ts` has **no per-item retry limit**. The flow: 1. Subagent completes → `runSubagentAnnounceFlow` runs 2. `maybeQueueSubagentAnnounce` returns `"queued"` → the registry considers announce **delivered** (`didAnnounce = true`) 3. `finalizeSubagentCleanup` marks cleanup complete 4. The queue's `scheduleAnnounceDrain` tries to send via `sendAnnounce` → `callGateway({method: "agent"})` 5. If `callGateway` throws (timeout, connection error, etc.), the `catch` block keeps the item in queue and reschedules the drain 6. **Loop repeats indefinitely** — the registry-level retry cap (3 attempts) is bypassed because the registry already finalized at step 3 The existing deterministic idempotency keys (`announceId`) prevent duplicate agent turns at the gateway level, but each timed-out retry still costs a `callGateway` round-trip and debounce cycle. ## Fix Add a `_sendAttempts` counter to `AnnounceQueueItem`. In the catch block of `scheduleAnnounceDrain`: - Increment the counter on each failure - When `_sendAttempts >= MAX_SEND_ATTEMPTS_PER_ITEM` (5), drop the item with an error log - Otherwise, retry with debounce (existing behavior) but now with attempt tracking in the log ## Test Results All 3 existing announce queue tests pass. Error messages now include attempt counts for observability. Fixes #19197 <!-- greptile_comment --> <h3>Greptile Summary</h3> Adds a per-item send retry cap (`MAX_SEND_ATTEMPTS_PER_ITEM = 5`) to the announce queue drain loop in `subagent-announce-queue.ts`, preventing the infinite retry loop described in #19197 where `sendAnnounce` failures caused 8+ hours of continuous retries. - Adds a `_sendAttempts` counter to `AnnounceQueueItem`, incremented in the catch block on each failure - Items are dropped with an error log once the retry limit is reached; otherwise retry continues with debounce (existing behavior) - Error messages now include attempt counts for improved observability - Complements the existing registry-level retry cap (3 attempts) which was bypassed once the queue marked announce as delivered <h3>Confidence Score: 4/5</h3> - This PR is safe to merge — it adds a bounded retry cap to an existing unbounded retry loop, with correct logic and minimal blast radius. - The change is small, focused, and addresses a well-documented production issue (#19197). The retry counter correctly tracks `queue.items[0]` across all send paths (followup, collect, summary modes). The `finally` block correctly re-schedules drain for remaining items after dropping one. Score is 4 rather than 5 because the existing tests don't cover the new cap behavior (they only verify single-failure retry), and in collect mode, N items could result in up to N*5 total attempts before all are dropped — bounded but potentially larger than expected. - No files require special attention — the single changed file has straightforward logic. <sub>Last reviewed commit: e74ddb1</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs