#16125: feat(gateway): add stuck session detection
gateway
size: S
Cluster:
Gateway and macOS Improvements
## Summary
Adds configurable stuck session detection to the gateway. Sessions that exceed a configurable time threshold are detected and handled according to the configured action.
## Problem
Long-running agent sessions can get stuck due to network issues, API hangs, or infinite loops. Currently there's no mechanism to detect or recover from these situations automatically.
## Solution
- Add `stuckDetection` config to gateway options
- Track `startedAt` timestamp on chat run entries
- Sweep every 60 seconds checking for sessions exceeding threshold
- Configurable threshold (default 5 minutes) and action (`log`, `notify`, or `abort`)
- Add `entries()` and `size()` methods to ChatRunRegistry for introspection
## Configuration
```json
{
"gateway": {
"stuckDetection": {
"enabled": true,
"thresholdMinutes": 5,
"action": "abort"
}
}
}
```
## Changes
- `src/config/types.gateway.ts` — Add `GatewayStuckDetectionConfig` type
- `src/config/zod-schema.ts` — Add zod validation schema
- `src/gateway/server-chat.ts` — Track `startedAt`, add registry methods
- `src/gateway/server-maintenance.ts` — Stuck detection sweep logic
- `src/gateway/server.impl.ts` — Wire up config
- Updated test files for new interfaces
## Testing
Existing tests updated to accommodate new interface changes. Stuck detection sweep runs within the existing maintenance timer infrastructure.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds configurable stuck session detection to the gateway, sweeping every 60 seconds and supporting `log`, `notify`, and `abort` actions. Also includes a correct bug fix in `chat-abort.ts` where `removeChatRun` was called with `runId` instead of `sessionId`.
- **Runtime crash**: `server.impl.ts:433` references `logGateway` which is not defined in scope — will throw a `ReferenceError` the first time the stuck detection sweep triggers. Should be `log`.
- **Skipped entries during abort**: `server-maintenance.ts:131-187` iterates over live `Map` and `Array` iterators while the `abort` action mutates them via `removeChatRun`, causing entries to be skipped. Snapshot the iterables before the loop.
- Good bug fix in `chat-abort.ts`: corrects the first argument to `removeChatRun` from `runId` to `active.sessionId`.
<h3>Confidence Score: 2/5</h3>
- This PR has a definite runtime crash (`logGateway` ReferenceError) and a mutation-during-iteration bug that must be fixed before merging.
- Score of 2 reflects two confirmed bugs: (1) an undefined variable reference that will crash at runtime on any stuck detection log, and (2) a collection mutation during iteration that silently skips entries when aborting stuck runs. The overall design and remaining changes are sound.
- `src/gateway/server.impl.ts` (undefined `logGateway` reference) and `src/gateway/server-maintenance.ts` (mutation during iteration in abort path)
<sub>Last reviewed commit: 7cf7f98</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#12234: gateway: incident tracking, recover command, and ciao ERR_SERVER_CL...
by levineam · 2026-02-09
76.1%
#10273: fix(agents): detect and auto-compact mid-run context overflow
by terryops · 2026-02-06
75.1%
#21944: feat(gateway): crash-loop protection with escalating backoff
by Protocol-zero-0 · 2026-02-20
74.8%
#8713: feat: gateway memory monitor, install linger, docs and failover
by quratus · 2026-02-04
74.6%
#20394: feat(gateway): make chat history byte limit configurable via gatewa...
by mgratch · 2026-02-18
74.2%
#14811: feat(gateway): route chat/agent events per-connection instead of glob…
by jiangjin11 · 2026-02-12
74.0%
#15762: fix(discord): add circuit breaker for WebSocket resume loop
by funmerlin · 2026-02-13
73.9%
#20431: fix(sessions): add session contamination guards and self-leak lock ...
by marcomarandiz · 2026-02-18
73.8%
#16330: fix(gateway): preserve conversation history on gateway restart
by openperf · 2026-02-14
73.8%
#12953: fix: defer gateway restart until all replies are sent
by zoskebutler · 2026-02-10
72.8%