← Back to PRs

#5593: fix(pi-embedded-runner): force-clear stuck embedded runs after timeout

by grassX1998 open 2026-01-31 17:53 View on GitHub →
agents
## Problem When an embedded run (subagent session) times out, the main session can become permanently stuck. This happens when: 1. The timeout handler calls `abortRun(true)` 2. The normal cleanup in `finally` block calls `clearActiveEmbeddedRun()` 3. But `clearActiveEmbeddedRun()` uses handle matching: ```js if (ACTIVE_EMBEDDED_RUNS.get(sessionId) === handle) { ... } ``` 4. If the handle doesn't match (race condition, replaced run, etc.), the entry is never removed 5. `isEmbeddedPiRunActive()` returns `true` forever 6. Main session blocks indefinitely on `waitForEmbeddedPiRunEnd()` ## Solution Add a force-cleanup mechanism: 1. New `forceClearActiveEmbeddedRun()` function that unconditionally removes the session entry (no handle check) 2. Schedule this cleanup 30 seconds after timeout fires 3. Log with `timeout_force_cleared` reason for debugging 4. Notify all waiters so blocked code can proceed The 30-second grace period ensures normal cleanup has ample time to complete, while providing a guaranteed escape hatch. ## Testing - Verified fix prevents session deadlock after subagent timeout - Normal cleanup path unchanged (grace period allows it to work) - Added logging for debugging force-clear scenarios ## Risk Assessment **Low risk:** - Does not modify normal execution path - Only activates after timeout + 30s grace period - Defensive measure that should rarely trigger <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR adds a defensive cleanup path for embedded Pi runs to prevent the main session from getting stuck after a timeout. Specifically: - `runEmbeddedAttempt` now schedules a delayed “force clear” of the active embedded-run entry after a timeout. - `runs.ts` adds `forceClearActiveEmbeddedRun(sessionId)`, which unconditionally deletes the session’s active-handle entry, logs `timeout_force_cleared`, and notifies any `waitForEmbeddedPiRunEnd` waiters. This fits into the existing embedded-run lifecycle management in `src/agents/pi-embedded-runner/runs.ts`, where `ACTIVE_EMBEDDED_RUNS` is used by other parts of the system (queueing, aborting, compaction, session deletion) to determine whether an embedded run is active/streaming and to coordinate waiting for completion. Main concern: the new force-clear mechanism is keyed only by `sessionId`, so a delayed timer from a timed-out run can clear a *new* run started later under the same sessionId, and `notifyEmbeddedRunEnded` will resolve waiters even if the newer run is still in progress. <h3>Confidence Score: 3/5</h3> - This PR is likely safe but has a real concurrency edge case that could clear the wrong active run. - Change is small and targeted, but the delayed, sessionId-only force-clear can race with run replacement and clear a newer run, which can break queue/steer semantics and incorrectly unblock waiters. - src/agents/pi-embedded-runner/run/attempt.ts and src/agents/pi-embedded-runner/runs.ts (timer/force-clear semantics) <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> **Context used:** - Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=fd949e91-5c3a-4ab5-90a1-cbe184fd6ce8)) - Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=0d0c8278-ef8e-4d6c-ab21-f5527e322f13)) <!-- /greptile_comment -->

Most Similar PRs