← Back to PRs

#17001: fix: retry sub-agent announcements with backoff instead of silently dropping on timeout

by luisecab open 2026-02-15 09:29 View on GitHub →
agents size: S
## Summary Sub-agent announcement delivery could be dropped on transient gateway failures (timeouts / closed connection). This PR preserves the retry-with-backoff behavior while rebasing onto latest `main`. Closes #17000 ## What this PR adds (unique value) - Retry announce delivery with exponential backoff (2s → 4s → 8s) - Retry only retryable errors (timeout / gateway closed) - Keep non-retryable failures immediate - Keep configurable announce timeout via `agents.defaults.subagents.announceTimeoutMs` (5s–300s, default 30s) ## Rebase alignment with main - Reused `src/agents/announce-idempotency.ts` (no duplicate idempotency implementation) - Kept deterministic announce idempotency keys for both queue and direct paths - Clarified `expectFinal` handling comment in direct announce path (left unset so retries confirm accept/dedupe instead of waiting for terminal run completion) ## Files changed - `src/agents/subagent-announce.ts` - `src/config/zod-schema.agent-defaults.ts` - `src/config/types.agent-defaults.ts` - `src/agents/subagent-announce-queue.ts` <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR adds retry-with-exponential-backoff for sub-agent announcement delivery to handle transient gateway failures (timeouts, abnormal WebSocket closures, connection resets) instead of silently dropping announcements. - Introduces `callGatewayWithRetry` wrapper with up to 3 retries and exponential backoff (2s → 4s → 8s), applied to both queued and direct announce delivery paths - Adds configurable `announceTimeoutMs` setting (5s–300s, default 30s) via `agents.defaults.subagents.announceTimeoutMs`, replacing the previous hardcoded 15s timeout - Narrows retry classification to exclude normal WebSocket closures (code 1000) using negative lookahead regex - Removes `expectFinal: true` from the direct announce path so retries only confirm accept/dedupe rather than waiting for terminal run completion - Correctly reuses deterministic idempotency keys across retries, ensuring gateway-level deduplication works as intended <h3>Confidence Score: 4/5</h3> - This PR is safe to merge — the retry logic is well-bounded, idempotent, and only targets transient failures. - Score of 4 reflects: clean retry implementation with proper bounds and exponential backoff, correct idempotency key reuse across retries, appropriate error classification with the narrowed regex, consistent type/schema additions. The only minor concern is the lack of unit tests for the new retry logic, though the existing integration test coverage and the defensive coding style mitigate risk. The timeout increase from 15s to 30s default is intentional and documented. - No files require special attention. The core logic in `src/agents/subagent-announce.ts` was reviewed thoroughly and the retry wrapper is straightforward. <sub>Last reviewed commit: 69897a0</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs