#5593: fix(pi-embedded-runner): force-clear stuck embedded runs after timeout
agents
## Problem
When an embedded run (subagent session) times out, the main session can become permanently stuck. This happens when:
1. The timeout handler calls `abortRun(true)`
2. The normal cleanup in `finally` block calls `clearActiveEmbeddedRun()`
3. But `clearActiveEmbeddedRun()` uses handle matching:
```js
if (ACTIVE_EMBEDDED_RUNS.get(sessionId) === handle) { ... }
```
4. If the handle doesn't match (race condition, replaced run, etc.), the entry is never removed
5. `isEmbeddedPiRunActive()` returns `true` forever
6. Main session blocks indefinitely on `waitForEmbeddedPiRunEnd()`
## Solution
Add a force-cleanup mechanism:
1. New `forceClearActiveEmbeddedRun()` function that unconditionally removes the session entry (no handle check)
2. Schedule this cleanup 30 seconds after timeout fires
3. Log with `timeout_force_cleared` reason for debugging
4. Notify all waiters so blocked code can proceed
The 30-second grace period ensures normal cleanup has ample time to complete, while providing a guaranteed escape hatch.
## Testing
- Verified fix prevents session deadlock after subagent timeout
- Normal cleanup path unchanged (grace period allows it to work)
- Added logging for debugging force-clear scenarios
## Risk Assessment
**Low risk:**
- Does not modify normal execution path
- Only activates after timeout + 30s grace period
- Defensive measure that should rarely trigger
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR adds a defensive cleanup path for embedded Pi runs to prevent the main session from getting stuck after a timeout. Specifically:
- `runEmbeddedAttempt` now schedules a delayed “force clear” of the active embedded-run entry after a timeout.
- `runs.ts` adds `forceClearActiveEmbeddedRun(sessionId)`, which unconditionally deletes the session’s active-handle entry, logs `timeout_force_cleared`, and notifies any `waitForEmbeddedPiRunEnd` waiters.
This fits into the existing embedded-run lifecycle management in `src/agents/pi-embedded-runner/runs.ts`, where `ACTIVE_EMBEDDED_RUNS` is used by other parts of the system (queueing, aborting, compaction, session deletion) to determine whether an embedded run is active/streaming and to coordinate waiting for completion.
Main concern: the new force-clear mechanism is keyed only by `sessionId`, so a delayed timer from a timed-out run can clear a *new* run started later under the same sessionId, and `notifyEmbeddedRunEnded` will resolve waiters even if the newer run is still in progress.
<h3>Confidence Score: 3/5</h3>
- This PR is likely safe but has a real concurrency edge case that could clear the wrong active run.
- Change is small and targeted, but the delayed, sessionId-only force-clear can race with run replacement and clear a newer run, which can break queue/steer semantics and incorrectly unblock waiters.
- src/agents/pi-embedded-runner/run/attempt.ts and src/agents/pi-embedded-runner/runs.ts (timer/force-clear semantics)
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
**Context used:**
- Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=fd949e91-5c3a-4ab5-90a1-cbe184fd6ce8))
- Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=0d0c8278-ef8e-4d6c-ab21-f5527e322f13))
<!-- /greptile_comment -->
Most Similar PRs
#18432: fix(agents): clear active run state immediately on embedded timeout
by BinHPdev · 2026-02-16
82.5%
#7760: fix(agents): resolve message ordering conflict during tool execution
by aryan877 · 2026-02-03
78.5%
#5467: Fix: make embedded compaction wait abortable
by yoyooyooo · 2026-01-31
78.5%
#6143: fix(agents): handle AbortError from activeSession.abort() on timeout
by Glucksberg · 2026-02-01
76.9%
#6268: fix: add timeout to compaction retry to prevent session lockout
by batumilove · 2026-02-01
76.7%
#17445: fix(pi-embedded): add aggregate timeout to compaction retry + harde...
by joeykrug · 2026-02-15
76.2%
#11999: fix: add session-growth guard to prevent unbounded session store gr...
by reverendrewind · 2026-02-08
74.4%
#2541: fix(agents): add error handling to orphaned message cleanup
by Episkey-G · 2026-01-27
74.1%
#10281: fix(infra): clear seqByRun entry when agent run context is cleared
by programming-pupil · 2026-02-06
73.5%
#17721: fix: abort child run on subagent timeout + retry with backoff + sta...
by IrriVisionTechnologies · 2026-02-16
73.1%