#17721: fix: abort child run on subagent timeout + retry with backoff + stale detection
agents
stale
size: L
Cluster:
Subagent Enhancements and Features
## Problem
When a sub-agent's `agent.wait` times out, the child run continues executing indefinitely in the background, consuming resources. The parent records a timeout outcome but never sends an abort signal to stop the child.
Additionally, if the gateway RPC itself fails (e.g., during a gateway restart), the error is silently swallowed with an empty `catch {}`, leaving the run in limbo until the 60-minute archive sweeper cleans it up.
## Changes
### 1. Abort child on timeout
When `waitForSubagentCompletion` receives a timeout response, it now calls `abortEmbeddedPiRun()` to actively stop the child run. This prevents zombie sub-agents from burning tokens after the parent has given up.
### 2. Retry with exponential backoff
RPC failures in `waitForSubagentCompletion` now trigger up to 3 retries with exponential backoff (5s → 10s → 20s). If all retries fail, the run is marked as failed and the announce flow notifies the requester. This handles transient gateway restarts gracefully.
### 3. Stale run detection
The sweeper now detects runs that have been active for >2 hours without completion. These are marked as failed, aborted, and announced to the requester. This catches edge cases where both the wait RPC and the lifecycle listener miss the completion event.
### 4. Tests
- New `abort-on-timeout.test.ts` — verifies abort is called on timeout and not on success
- Updated `wait-retry.test.ts` — added mocks for new imports
## Impact
- No more zombie sub-agents running after timeout
- Resilient to gateway restarts during sub-agent execution
- Orphaned runs are detected and cleaned up within 2 hours
- All changes are backward-compatible (new fields on SubagentRunRecord are optional)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR adds three critical reliability improvements to subagent run management: abort on timeout, retry with exponential backoff, and stale run detection. The implementation prevents zombie subagents from running indefinitely when the parent times out by calling `abortChildRun()` when `wait.status === "timeout"` (line 601-602). RPC failures now trigger up to 3 retries with exponential backoff (5s → 10s → 20s) via `scheduleWaitRetry()`, and the sweeper detects runs active for >2 hours without completion to mark them as failed. All new optional fields (`waitRetryCount`, `lastWaitRetryAt`) are backward-compatible, and comprehensive tests verify abort behavior and retry logic.
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk
- The implementation is well-designed with proper error handling, backward-compatible changes, and comprehensive test coverage for all three features (abort-on-timeout, retry logic, stale detection). The logic correctly handles edge cases like clearing retry state on success and preventing double-announces.
- No files require special attention
<sub>Last reviewed commit: 8c6600c</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#21115: fix(agent): immediately abort active run on stop/abort message during…
by anillBhoi · 2026-02-19
81.1%
#18205: fix (agents): add periodic retry timer for failed subagent announces
by MegaPhoenix92 · 2026-02-16
80.7%
#17001: fix: retry sub-agent announcements with backoff instead of silently...
by luisecab · 2026-02-15
80.3%
#13105: fix: debounce subagent lifecycle events to prevent premature announ...
by mcaxtr · 2026-02-10
80.3%
#17028: fix(subagent): retry announce on timeout
by Limitless2023 · 2026-02-15
79.7%
#22719: fix(agents): make subagent announce timeout configurable (restore 6...
by Valadon · 2026-02-21
78.9%
#18468: fix(agents): prevent infinite retry loops in sub-agent completion a...
by BinHPdev · 2026-02-16
78.6%
#20328: fix(agents): Add retry with exponential backoff for subagent announ...
by tiny-ship-it · 2026-02-18
78.5%
#6143: fix(agents): handle AbortError from activeSession.abort() on timeout
by Glucksberg · 2026-02-01
78.4%
#12477: fix(agents): prevent TimeoutOverflowWarning when timeout is disabled
by skylarkoo7 · 2026-02-09
78.3%