#21555: fix: abort streaming runs after 90s of inactivity
agents
size: S
Cluster:
Session Management and Fixes
When the upstream LLM API accepts a connection but produces zero SSE chunks, the session stays stuck in "processing" with no user feedback. This adds a streaming inactivity watchdog (90s threshold) that aborts and retries stalled runs, notifies the user via block reply, and refreshes the typing indicator TTL across prompt cycles.
Closes #17258
## Summary
Describe the problem and fix in 2–5 bullets:
- Problem:
- Why it matters:
- What changed:
- What did NOT change (scope boundary):
## Change Type (select all)
- [ ] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Docs
- [ ] Security hardening
- [ ] Chore/infra
## Scope (select all touched areas)
- [ ] Gateway / orchestration
- [ ] Skills / tool execution
- [ ] Auth / tokens
- [ ] Memory / storage
- [ ] Integrations
- [ ] API / contracts
- [ ] UI / DX
- [ ] CI/CD / infra
## Linked Issue/PR
- Closes #
- Related #
## User-visible / Behavior Changes
List user-visible changes (including defaults/config).
If none, write `None`.
## Security Impact (required)
- New permissions/capabilities? (`Yes/No`)
- Secrets/tokens handling changed? (`Yes/No`)
- New/changed network calls? (`Yes/No`)
- Command/tool execution surface changed? (`Yes/No`)
- Data access scope changed? (`Yes/No`)
- If any `Yes`, explain risk + mitigation:
## Repro + Verification
### Environment
- OS:
- Runtime/container:
- Model/provider:
- Integration/channel (if any):
- Relevant config (redacted):
### Steps
1.
2.
3.
### Expected
-
### Actual
-
## Evidence
Attach at least one:
- [ ] Failing test/log before + passing after
- [ ] Trace/log snippets
- [ ] Screenshot/recording
- [ ] Perf numbers (if relevant)
## Human Verification (required)
What you personally verified (not just CI), and how:
- Verified scenarios:
- Edge cases checked:
- What you did **not** verify:
## Compatibility / Migration
- Backward compatible? (`Yes/No`)
- Config/env changes? (`Yes/No`)
- Migration needed? (`Yes/No`)
- If yes, exact upgrade steps:
## Failure Recovery (if this breaks)
- How to disable/revert this change quickly:
- Files/config to restore:
- Known bad symptoms reviewers should watch for:
## Risks and Mitigations
List only real risks for this PR. Add/remove entries as needed. If none, write `None`.
- Risk:
- Mitigation:
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR adds a 90-second streaming inactivity watchdog to detect and recover from stalled LLM API connections. When the upstream LLM API accepts a connection but produces no SSE chunks, the watchdog aborts the run and retries with a user notification. The implementation also refreshes typing indicator TTL on each prompt cycle to maintain user feedback during long-running operations.
**Key Changes:**
- Added streaming inactivity detection in `attempt.ts` with 90s timeout and 30s recheck interval
- `onStreamActivity` callback resets the watchdog timer on every SSE event
- `onPromptCycleStart` callback refreshes typing indicators before each prompt
- User notification shows retry count via `onBlockReply` when stalls are detected
- Timer cleanup in finally block prevents resource leaks
**Implementation Quality:**
- Properly integrates with existing abort/timeout infrastructure
- Checks `activeSession.isStreaming` before aborting to avoid false positives
- Recursive timer pattern allows monitoring throughout the session lifecycle
- Cleanup is handled correctly in the finally block
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge with minor risk - adds defensive timeout logic with proper cleanup
- The implementation is well-structured with proper timer cleanup and integration with existing abort mechanisms. The watchdog logic is sound: it checks inactivity duration and streaming state before aborting, and the recursive timer pattern ensures continuous monitoring. User notifications are handled gracefully with retry counts. One minor concern is the lack of tests for the new streaming inactivity logic, though the core pattern follows existing timeout handling patterns in the codebase.
- No files require special attention - the changes are well-contained and follow existing patterns
<sub>Last reviewed commit: 2efa1fb</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#17265: fix: abort streaming runs after 90s of inactivity
by jg-noncelogic · 2026-02-15
90.5%
#23720: Feat/cli backend runtime tuning
by wanmorebot · 2026-02-22
75.8%
#22454: fix(macos): add re-subscribe loop to gateway stream subscribers
by mandofever78 · 2026-02-21
74.5%
#19673: fix(telegram): avoid starting streaming replies with only 1-2 words
by emanuelst · 2026-02-18
73.2%
#11688: feat(telegram): add health check watchdog for long-polling
by rmfalco89 · 2026-02-08
73.1%
#22367: fix(whatsapp): prevent permanent listener loss after abort during r...
by mcinteerj · 2026-02-21
72.6%
#6302: fix: Add timeouts to prevent indefinite hangs (issues #4954, #4956,...
by batumilove · 2026-02-01
72.2%
#7247: fix(telegram): abort stale getUpdates connections after long-poll t...
by JanderV · 2026-02-02
72.1%
#19648: fix: suppress silent-reply partial tokens during streaming
by bradleypriest · 2026-02-18
72.0%
#23621: fix(LINE): keep startAccount promise alive to prevent auto-restart ...
by ttakanawa · 2026-02-22
72.0%