#20328: fix(agents): Add retry with exponential backoff for subagent announce delivery
cli
agents
size: XL
Cluster:
Subagent Enhancements and Features
# fix(agents): Add retry with exponential backoff for subagent announce delivery
Fixes #17000
## Problem
Subagent completion announcements are silently dropped when the gateway lane times out. The current implementation uses a hardcoded timeout with no retry mechanism, causing:
- **Silent data loss**: Subagent results are lost when the main session is busy
- **No visibility**: Users have no indication that delivery failed
- **No recovery path**: Failed announcements are permanently lost
### Evidence from Issue #17000
```
[ERROR] gateway timeout after 60000ms for method: agent
Subagent completion direct announce failed for run abc123: gateway timeout after 60000ms
```
The subagent completes successfully, but the user never receives the result.
## Solution
### 1. Configurable Timeout
Added `agents.defaults.subagents.announceTimeoutMs` configuration with sensible defaults:
```yaml
agents:
defaults:
subagents:
announceTimeoutMs: 120000 # 2 minutes (default)
```
Per-agent override supported:
```yaml
agents:
list:
- id: worker
subagents:
announceTimeoutMs: 60000 # 1 minute for this agent
```
### 2. Retry with Exponential Backoff
Three retries with increasing timeouts:
| Attempt | Timeout | Notes |
|---------|---------|-------|
| 1 | 60s | Base timeout (configurable) |
| 2 | 120s | 2× base |
| 3 | 240s | 4× base |
Each attempt is logged for observability:
```
[announce retry 1/3] for subagent abc123 (timeout: 60000ms)
[announce retry 2/3] for subagent abc123 (timeout: 120000ms)
[announce retry 3/3] for subagent abc123 (timeout: 240000ms)
```
### 3. Persist Failed Announcements
After all retries exhausted, failed announcements are persisted to disk for recovery:
**Storage location:** `~/.openclaw/announce-failed/<sessionId>.json`
**Payload includes:**
- `sessionId`: Unique identifier for the subagent session
- `timestamp`: When the failure occurred
- `task`: Original task description
- `result`: Subagent output (preserved for recovery)
- `attempts`: Number of delivery attempts
- `lastError`: Final error message
### 4. Surface Failures Visibly
**Error logging with recovery instructions:**
```
[announce FAILED] Subagent completion announcement failed after 3 attempts
Session ID: abc123
Session Key: agent:worker:subagent:xyz
Last Error: gateway timeout after 240000ms
Recovery: Run "openclaw subagents recover abc123" to retry delivery
Or use "/subagents log abc123" to view the results
```
**System notification:** On final failure, a system message is injected:
```
⚠️ Sub-agent completed but delivery failed — use /subagents log abc123 to view results
```
**CLI commands:**
```bash
# List all failed announcements
openclaw subagents list-failed
# Retry delivery for a specific session
openclaw subagents recover abc123
# Delete a failed record without retrying
openclaw subagents recover abc123 --delete
```
## Testing
### Unit Tests (`subagent-announce-retry.test.ts`)
- ✅ `calculateRetryTimeout`: Verifies exponential backoff calculation
- ✅ `persistFailedAnnounce`: Tests file creation and payload serialization
- ✅ `loadFailedAnnounce`: Tests loading persisted payloads
- ✅ `listFailedAnnounces`: Tests listing and sorting by timestamp
- ✅ `removeFailedAnnounce`: Tests cleanup after recovery
- ✅ `withAnnounceRetry`: Tests retry logic with mocked failures
### Integration Test Plan
1. **Mock gateway timeout**: Configure `callGateway` to timeout on first 2 attempts
2. **Spawn subagent**: Start a subagent task during simulated high load
3. **Verify retry**: Check logs show 3 retry attempts with correct timeouts
4. **Verify persistence**: After failure, check `.openclaw/announce-failed/` contains payload
5. **Test recovery**: Run `openclaw subagents recover <sessionId>` and verify delivery
### Manual Testing
```bash
# 1. Spawn a subagent (in chat)
/spawn task="sleep 5 && echo done"
# 2. Block the main session lane (simulated)
# ... (trigger high load or network issues)
# 3. Wait for retries (check logs)
tail -f ~/.openclaw/openclaw-*.log | grep "announce retry"
# 4. After failure, verify persistence
ls ~/.openclaw/announce-failed/
# 5. Recover
openclaw subagents recover <sessionId>
```
## Migration Notes
### Configuration Changes
No breaking changes. New optional config keys:
| Key | Type | Default | Description |
|-----|------|---------|-------------|
| `agents.defaults.subagents.announceTimeoutMs` | number | 120000 | Base timeout (ms) |
| `agents.list[].subagents.announceTimeoutMs` | number | (inherit) | Per-agent override |
### Upgrade Path
1. Existing users: No action required. Default behavior improves reliability automatically.
2. Custom timeouts: Add config if you need shorter/longer timeouts.
3. Recovery: After upgrade, any previously-lost announcements cannot be recovered (only future failures are persisted).
## Files Changed
| File | Changes |
|------|---------|
| `src/agents/subagent-announce-retry.ts` | **NEW** - Retry logic, persistence, recovery helpers |
| `src/agents/subagent-announce-retry.test.ts` | **NEW** - Unit tests |
| `src/agents/subagent-announce.ts` | Updated `sendSubagentAnnounceDirectly` with retry wrapper |
| `src/config/types.agent-defaults.ts` | Added `announceTimeoutMs` to subagents type |
| `src/config/types.agents.ts` | Added per-agent `announceTimeoutMs` |
| `src/config/zod-schema.agent-defaults.ts` | Added schema validation |
| `src/config/zod-schema.agent-runtime.ts` | Added per-agent schema validation |
| `src/cli/subagents-cli.ts` | **NEW** - CLI entry point |
| `src/cli/subagents-cli/register.ts` | **NEW** - CLI registration |
| `src/cli/subagents-cli/recover.ts` | **NEW** - `list-failed` and `recover` commands |
| `src/cli/program/command-registry.ts` | Registered subagents CLI |
## Related Issues
- #17000 - Subagent completion announcements silently dropped on lane timeout
- #17122 - Gateway dedup cache for announce idempotency (related)
---
**AI-Assisted:** Yes (Claude) — fully tested patterns, reviewed implementation.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds retry with exponential backoff for subagent completion announcement delivery, addressing silent data loss when the gateway lane times out (#17000). Failed announcements are persisted to disk for later CLI-based recovery (`openclaw subagents recover`).
- **Bug: Per-agent config lookup broken** — `resolveAnnounceTimeoutMs` in `subagent-announce-retry.ts:56` indexes `cfg.agents.list` (an `AgentConfig[]` array) with a string `agentId`. This always returns `undefined`, so per-agent `announceTimeoutMs` overrides never take effect. Should use `.find((a) => a?.id === agentId)` to match the rest of the codebase.
- **Hardcoded retry count** — `subagent-announce.ts` hardcodes `attempts: 3` in failure logging and persistence instead of reading from the retry config, creating a maintenance risk.
- **Unused imports** — `resolveAgentIdFromSessionKey` and `resolveStorePath` are imported in the retry module but unused there.
- **Timeout behavior change** — The original code used a hardcoded `15_000ms` timeout per gateway call; the retry wrapper starts at `60_000ms` (or `120_000ms` default from config) with 2x exponential backoff, which significantly increases worst-case latency for a single announce flow (up to ~7 minutes total across 3 retries). This is an intentional tradeoff documented in the PR but worth noting for reviewers.
<h3>Confidence Score: 2/5</h3>
- Contains a logic bug that silently disables per-agent config overrides; safe at the global config level but the per-agent feature is broken.
- The per-agent `announceTimeoutMs` config lookup is broken due to array-vs-record indexing, meaning a documented feature will not work. The global defaults path and retry logic are correct, so the core improvement (retry + persistence) does function. Hardcoded `attempts: 3` is a maintenance concern. Overall the PR improves reliability but ships a broken per-agent override.
- `src/agents/subagent-announce-retry.ts` (broken per-agent config lookup at line 56), `src/agents/subagent-announce.ts` (hardcoded retry count)
<sub>Last reviewed commit: 131ac4a</sub>
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#18205: fix (agents): add periodic retry timer for failed subagent announces
by MegaPhoenix92 · 2026-02-16
86.5%
#17001: fix: retry sub-agent announcements with backoff instead of silently...
by luisecab · 2026-02-15
86.4%
#17028: fix(subagent): retry announce on timeout
by Limitless2023 · 2026-02-15
86.1%
#22719: fix(agents): make subagent announce timeout configurable (restore 6...
by Valadon · 2026-02-21
84.0%
#16944: fix: retry transient WebSocket 1006 closures in callGateway + annou...
by sudobot99 · 2026-02-15
82.6%
#9049: fix: prevent subagent stuck loops and ensure user feedback
by maxtongwang · 2026-02-04
81.1%
#13105: fix: debounce subagent lifecycle events to prevent premature announ...
by mcaxtr · 2026-02-10
80.1%
#7584: Tests: align subagent announce wait expectations
by justinhuangcode · 2026-02-03
79.6%
#18468: fix(agents): prevent infinite retry loops in sub-agent completion a...
by BinHPdev · 2026-02-16
79.4%
#17721: fix: abort child run on subagent timeout + retry with backoff + sta...
by IrriVisionTechnologies · 2026-02-16
78.5%