#20328: fix(agents): Add retry with exponential backoff for subagent announce delivery

by tiny-ship-it open 2026-02-18 20:03 View on GitHub →

cli agents size: XL

Cluster: Subagent Enhancements and Features

# fix(agents): Add retry with exponential backoff for subagent announce delivery Fixes #17000 ## Problem Subagent completion announcements are silently dropped when the gateway lane times out. The current implementation uses a hardcoded timeout with no retry mechanism, causing: - **Silent data loss**: Subagent results are lost when the main session is busy - **No visibility**: Users have no indication that delivery failed - **No recovery path**: Failed announcements are permanently lost ### Evidence from Issue #17000 ``` [ERROR] gateway timeout after 60000ms for method: agent Subagent completion direct announce failed for run abc123: gateway timeout after 60000ms ``` The subagent completes successfully, but the user never receives the result. ## Solution ### 1. Configurable Timeout Added `agents.defaults.subagents.announceTimeoutMs` configuration with sensible defaults: ```yaml agents: defaults: subagents: announceTimeoutMs: 120000 # 2 minutes (default) ``` Per-agent override supported: ```yaml agents: list: - id: worker subagents: announceTimeoutMs: 60000 # 1 minute for this agent ``` ### 2. Retry with Exponential Backoff Three retries with increasing timeouts: | Attempt | Timeout | Notes | |---------|---------|-------| | 1 | 60s | Base timeout (configurable) | | 2 | 120s | 2× base | | 3 | 240s | 4× base | Each attempt is logged for observability: ``` [announce retry 1/3] for subagent abc123 (timeout: 60000ms) [announce retry 2/3] for subagent abc123 (timeout: 120000ms) [announce retry 3/3] for subagent abc123 (timeout: 240000ms) ``` ### 3. Persist Failed Announcements After all retries exhausted, failed announcements are persisted to disk for recovery: **Storage location:** `~/.openclaw/announce-failed/<sessionId>.json` **Payload includes:** - `sessionId`: Unique identifier for the subagent session - `timestamp`: When the failure occurred - `task`: Original task description - `result`: Subagent output (preserved for recovery) - `attempts`: Number of delivery attempts - `lastError`: Final error message ### 4. Surface Failures Visibly **Error logging with recovery instructions:** ``` [announce FAILED] Subagent completion announcement failed after 3 attempts Session ID: abc123 Session Key: agent:worker:subagent:xyz Last Error: gateway timeout after 240000ms Recovery: Run "openclaw subagents recover abc123" to retry delivery Or use "/subagents log abc123" to view the results ``` **System notification:** On final failure, a system message is injected: ``` ⚠️ Sub-agent completed but delivery failed — use /subagents log abc123 to view results ``` **CLI commands:** ```bash # List all failed announcements openclaw subagents list-failed # Retry delivery for a specific session openclaw subagents recover abc123 # Delete a failed record without retrying openclaw subagents recover abc123 --delete ``` ## Testing ### Unit Tests (`subagent-announce-retry.test.ts`) - ✅ `calculateRetryTimeout`: Verifies exponential backoff calculation - ✅ `persistFailedAnnounce`: Tests file creation and payload serialization - ✅ `loadFailedAnnounce`: Tests loading persisted payloads - ✅ `listFailedAnnounces`: Tests listing and sorting by timestamp - ✅ `removeFailedAnnounce`: Tests cleanup after recovery - ✅ `withAnnounceRetry`: Tests retry logic with mocked failures ### Integration Test Plan 1. **Mock gateway timeout**: Configure `callGateway` to timeout on first 2 attempts 2. **Spawn subagent**: Start a subagent task during simulated high load 3. **Verify retry**: Check logs show 3 retry attempts with correct timeouts 4. **Verify persistence**: After failure, check `.openclaw/announce-failed/` contains payload 5. **Test recovery**: Run `openclaw subagents recover <sessionId>` and verify delivery ### Manual Testing ```bash # 1. Spawn a subagent (in chat) /spawn task="sleep 5 && echo done" # 2. Block the main session lane (simulated) # ... (trigger high load or network issues) # 3. Wait for retries (check logs) tail -f ~/.openclaw/openclaw-*.log | grep "announce retry" # 4. After failure, verify persistence ls ~/.openclaw/announce-failed/ # 5. Recover openclaw subagents recover <sessionId> ``` ## Migration Notes ### Configuration Changes No breaking changes. New optional config keys: | Key | Type | Default | Description | |-----|------|---------|-------------| | `agents.defaults.subagents.announceTimeoutMs` | number | 120000 | Base timeout (ms) | | `agents.list[].subagents.announceTimeoutMs` | number | (inherit) | Per-agent override | ### Upgrade Path 1. Existing users: No action required. Default behavior improves reliability automatically. 2. Custom timeouts: Add config if you need shorter/longer timeouts. 3. Recovery: After upgrade, any previously-lost announcements cannot be recovered (only future failures are persisted). ## Files Changed | File | Changes | |------|---------| | `src/agents/subagent-announce-retry.ts` | **NEW** - Retry logic, persistence, recovery helpers | | `src/agents/subagent-announce-retry.test.ts` | **NEW** - Unit tests | | `src/agents/subagent-announce.ts` | Updated `sendSubagentAnnounceDirectly` with retry wrapper | | `src/config/types.agent-defaults.ts` | Added `announceTimeoutMs` to subagents type | | `src/config/types.agents.ts` | Added per-agent `announceTimeoutMs` | | `src/config/zod-schema.agent-defaults.ts` | Added schema validation | | `src/config/zod-schema.agent-runtime.ts` | Added per-agent schema validation | | `src/cli/subagents-cli.ts` | **NEW** - CLI entry point | | `src/cli/subagents-cli/register.ts` | **NEW** - CLI registration | | `src/cli/subagents-cli/recover.ts` | **NEW** - `list-failed` and `recover` commands | | `src/cli/program/command-registry.ts` | Registered subagents CLI | ## Related Issues - #17000 - Subagent completion announcements silently dropped on lane timeout - #17122 - Gateway dedup cache for announce idempotency (related) --- **AI-Assisted:** Yes (Claude) — fully tested patterns, reviewed implementation.  <h3>Greptile Summary</h3> Adds retry with exponential backoff for subagent completion announcement delivery, addressing silent data loss when the gateway lane times out (#17000). Failed announcements are persisted to disk for later CLI-based recovery (`openclaw subagents recover`). - **Bug: Per-agent config lookup broken** — `resolveAnnounceTimeoutMs` in `subagent-announce-retry.ts:56` indexes `cfg.agents.list` (an `AgentConfig[]` array) with a string `agentId`. This always returns `undefined`, so per-agent `announceTimeoutMs` overrides never take effect. Should use `.find((a) => a?.id === agentId)` to match the rest of the codebase. - **Hardcoded retry count** — `subagent-announce.ts` hardcodes `attempts: 3` in failure logging and persistence instead of reading from the retry config, creating a maintenance risk. - **Unused imports** — `resolveAgentIdFromSessionKey` and `resolveStorePath` are imported in the retry module but unused there. - **Timeout behavior change** — The original code used a hardcoded `15_000ms` timeout per gateway call; the retry wrapper starts at `60_000ms` (or `120_000ms` default from config) with 2x exponential backoff, which significantly increases worst-case latency for a single announce flow (up to ~7 minutes total across 3 retries). This is an intentional tradeoff documented in the PR but worth noting for reviewers. <h3>Confidence Score: 2/5</h3> - Contains a logic bug that silently disables per-agent config overrides; safe at the global config level but the per-agent feature is broken. - The per-agent `announceTimeoutMs` config lookup is broken due to array-vs-record indexing, meaning a documented feature will not work. The global defaults path and retry logic are correct, so the core improvement (retry + persistence) does function. Hardcoded `attempts: 3` is a maintenance concern. Overall the PR improves reliability but ships a broken per-agent override. - `src/agents/subagent-announce-retry.ts` (broken per-agent config lookup at line 56), `src/agents/subagent-announce.ts` (hardcoded retry count) <sub>Last reviewed commit: 131ac4a</sub>  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>