← Back to PRs

#18205: fix (agents): add periodic retry timer for failed subagent announces

by MegaPhoenix92 open 2026-02-16 16:12 View on GitHub →
agents stale size: S
## Summary When `runSubagentAnnounceFlow` fails or defers (child still active, empty reply, gateway call failure), the announce is silently lost until the next gateway restart. There is currently no mechanism to retry these failed announces at runtime. This adds a periodic retry timer with bounded exponential backoff that catches all silent announce failure modes: - **Backoff schedule:** 30s → 60s → 120s → 240s → 480s cap, max 10 retries - **Bypasses `resumedRuns` dedupe** by calling `beginSubagentCleanup` directly (the `resumedRuns` Set prevents `resumeSubagentRun` from re-entering for the same runId) - **Auto-lifecycle:** timer starts when a failed announce is detected (in `finalizeSubagentCleanup`) or at boot when retryable entries exist (in `restoreSubagentRunsOnce`), stops when no retryable entries remain - **New fields on `SubagentRunRecord`:** `announceRetryCount`, `lastAnnounceRetryAt` (persisted to disk) ### Failure modes this addresses 1. `beginSubagentCleanup` race between lifecycle listener and `agent.wait` paths — both can fire, second one is gated, but neither retries on failure 2. `runSubagentAnnounceFlow` defers when child is still active or reply is empty — no follow-up scheduled 3. Direct `callGateway("agent")` failures caught silently in the announce flow 4. `ACTIVE_EMBEDDED_RUNS` is a process-local Map — if the run completed while gateway was down, no event fires ## Test plan - [x] Existing 6 persistence E2E tests still pass - [x] New test: "periodic retry picks up failed announces without gateway restart" — seeds a failed run on disk, verifies boot-time resume fails, then verifies a second init cycle succeeds - [x] Announce queue unit tests pass - [x] TypeScript type-check clean (no new errors) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- greptile_comment --> <h3>Greptile Summary</h3> Adds a periodic retry timer with bounded exponential backoff (30s → 480s cap, max 10 retries) to catch silent subagent announce failures at runtime, rather than requiring a full gateway restart. The timer integrates with the existing cleanup lifecycle: it starts when `finalizeSubagentCleanup` detects a failed announce or at boot when retryable entries exist, and stops when no retryable entries remain. Two new fields (`announceRetryCount`, `lastAnnounceRetryAt`) are persisted to the `SubagentRunRecord`. - `isRetryableEntry` is missing a check for `suppressAnnounceReason`, which could cause the retry timer to announce steer-restart-suppressed entries — bypassing the existing suppression guards in `resumeSubagentRun` and the lifecycle listener. - The new test titled "periodic retry picks up failed announces without gateway restart" actually tests a full registry restart cycle (the same pattern as the existing retry test), not the timer-driven retry path. Using `vi.useFakeTimers()` would validate the actual `setInterval` mechanism. <h3>Confidence Score: 3/5</h3> - The retry mechanism is sound overall but has a logic gap that could announce suppressed entries during steer-restart flows. - The core implementation is well-structured with proper backoff, lifecycle management, and integration with existing cleanup flows. However, the missing `suppressAnnounceReason` check in `isRetryableEntry` is a logic bug that could cause incorrect announces during steer-restart scenarios. The new test also doesn't exercise the actual timer-driven path it claims to test. These issues are fixable but should be addressed before merge. - Pay close attention to `src/agents/subagent-registry.ts` — specifically the `isRetryableEntry` function which gates the retry logic. <sub>Last reviewed commit: 56fba24</sub> <!-- greptile_other_comments_section --> <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub> <!-- /greptile_comment -->

Most Similar PRs