#17265: fix: abort streaming runs after 90s of inactivity
agents
size: S
Cluster:
Session Management and Fixes
## Summary
- Adds a streaming inactivity watchdog (90s threshold) that detects when the upstream LLM API accepts a connection but sends zero SSE chunks, and aborts + retries the stalled run
- Notifies the user via block reply ("Response stalled — retrying...") when a run is retried due to inactivity timeout
- Refreshes the typing indicator TTL across prompt cycles so users continue to see "typing..." instead of silence during tool→prompt transitions
Closes #17258
## Details
When the upstream API hangs mid-stream (zero tokens for extended periods), the only existing safety net is the per-attempt absolute timeout (default 300s). This leaves users staring at "typing..." for 2 minutes (TTL expiry), then silence for another 3 minutes before the run finally aborts.
The new inactivity timer fires an `onStreamActivity` callback on every streaming event. If no events arrive for 90 seconds while the session is streaming, the watchdog aborts the run and the retry loop picks it up — cutting worst-case wait from 5 minutes to ~90 seconds.
**Files changed (7):**
- `pi-embedded-subscribe.types.ts` / `.handlers.ts` — `onStreamActivity` callback
- `run/attempt.ts` — inactivity watchdog timer + `onPromptCycleStart` hook
- `run/types.ts` / `run/params.ts` — new callback types
- `run.ts` — stall notification on retry
- `agent-runner-execution.ts` — typing refresh wiring
## Test plan
- [x] `pnpm test src/agents/pi-embedded-subscribe` — 20 files, 67 tests passed
- [x] `pnpm test src/agents/pi-embedded-runner/` — 8 files, 81 tests passed
- [x] `pnpm test src/auto-reply/reply/agent-runner` — 19 files, 48 tests passed
- [x] `pnpm check` — format, types, lint all clean
- [ ] Deploy and verify: typing indicator stays alive across tool→prompt transitions
- [ ] Deploy and verify: stalled runs abort within ~90s instead of 5 min
- [ ] Deploy and verify: user sees "Response stalled — retrying..." notification
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds a 90-second streaming inactivity watchdog to detect stalled LLM API connections that accept the connection but send zero SSE chunks. When triggered, the watchdog aborts the run, notifies the user with "Response stalled — retrying...", and the existing retry loop picks up the retried attempt. Also wires an `onPromptCycleStart` callback to refresh typing indicator TTLs across tool→prompt transitions.
- **Inactivity watchdog** (`attempt.ts`): A `setTimeout`-based watchdog starts at 90s and reschedules every 30s to poll for activity. Guards against false positives by checking `activeSession.isStreaming` before aborting. Activity is reset on every SSE event via `onStreamActivity` callback and before each `prompt()` call. A `streamWatchdogDone` flag prevents dangling timers after the attempt completes.
- **Stall notification** (`run.ts`): Fires a best-effort block reply ("Response stalled — retrying..." / "Still waiting for response (retry N)...") when a timeout triggers profile rotation. Correctly placed before the `if (rotated)` check so single-profile setups also receive the notification.
- **Typing indicator refresh** (`agent-runner-execution.ts`): Calls `signalToolStart()` via the new `onPromptCycleStart` hook to keep typing indicators alive during prompt transitions.
- **Event plumbing** (`pi-embedded-subscribe.handlers.ts`, `.types.ts`): New `onStreamActivity` callback fires at the top of the event handler for every SSE event type.
- **Tests**: New e2e test verifies `onStreamActivity` fires on every event type, handles omission gracefully, and covers agent lifecycle events.
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — well-scoped defensive feature with proper guards against false positives and dangling state.
- The implementation is clean and well-reasoned: the watchdog correctly guards against false positives via the `isStreaming` check, prevents dangling timers with the `streamWatchdogDone` flag, and the stall notification fires independently of profile rotation. The fire-and-forget pattern for notifications appropriately prevents channel errors from blocking retries. Tests cover the core callback wiring. Score is 4 rather than 5 because the changes touch critical retry/abort paths in production and the deploy-and-verify test plan items are still unchecked.
- `src/agents/pi-embedded-runner/run/attempt.ts` — contains the watchdog timer logic and abort integration, which is the most critical path in this PR.
<sub>Last reviewed commit: 0a0da00</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#21555: fix: abort streaming runs after 90s of inactivity
by jg-noncelogic · 2026-02-20
90.5%
#23720: Feat/cli backend runtime tuning
by wanmorebot · 2026-02-22
75.2%
#11688: feat(telegram): add health check watchdog for long-polling
by rmfalco89 · 2026-02-08
74.2%
#12651: fix: prevent stale timeout from triggering duplicate message sends
by janckerchen · 2026-02-09
74.1%
#9092: fix: skip retry when block content already streamed to user
by benleavett · 2026-02-04
73.1%
#17445: fix(pi-embedded): add aggregate timeout to compaction retry + harde...
by joeykrug · 2026-02-15
73.0%
#22454: fix(macos): add re-subscribe loop to gateway stream subscribers
by mandofever78 · 2026-02-21
72.7%
#19648: fix: suppress silent-reply partial tokens during streaming
by bradleypriest · 2026-02-18
72.6%
#17721: fix: abort child run on subagent timeout + retry with backoff + sta...
by IrriVisionTechnologies · 2026-02-16
72.2%
#10273: fix(agents): detect and auto-compact mid-run context overflow
by terryops · 2026-02-06
71.4%