← Back to PRs

#16125: feat(gateway): add stuck session detection

by CyberSinister open 2026-02-14 09:20 View on GitHub →
gateway size: S
## Summary Adds configurable stuck session detection to the gateway. Sessions that exceed a configurable time threshold are detected and handled according to the configured action. ## Problem Long-running agent sessions can get stuck due to network issues, API hangs, or infinite loops. Currently there's no mechanism to detect or recover from these situations automatically. ## Solution - Add `stuckDetection` config to gateway options - Track `startedAt` timestamp on chat run entries - Sweep every 60 seconds checking for sessions exceeding threshold - Configurable threshold (default 5 minutes) and action (`log`, `notify`, or `abort`) - Add `entries()` and `size()` methods to ChatRunRegistry for introspection ## Configuration ```json { "gateway": { "stuckDetection": { "enabled": true, "thresholdMinutes": 5, "action": "abort" } } } ``` ## Changes - `src/config/types.gateway.ts` — Add `GatewayStuckDetectionConfig` type - `src/config/zod-schema.ts` — Add zod validation schema - `src/gateway/server-chat.ts` — Track `startedAt`, add registry methods - `src/gateway/server-maintenance.ts` — Stuck detection sweep logic - `src/gateway/server.impl.ts` — Wire up config - Updated test files for new interfaces ## Testing Existing tests updated to accommodate new interface changes. Stuck detection sweep runs within the existing maintenance timer infrastructure. <!-- greptile_comment --> <h3>Greptile Summary</h3> Adds configurable stuck session detection to the gateway, sweeping every 60 seconds and supporting `log`, `notify`, and `abort` actions. Also includes a correct bug fix in `chat-abort.ts` where `removeChatRun` was called with `runId` instead of `sessionId`. - **Runtime crash**: `server.impl.ts:433` references `logGateway` which is not defined in scope — will throw a `ReferenceError` the first time the stuck detection sweep triggers. Should be `log`. - **Skipped entries during abort**: `server-maintenance.ts:131-187` iterates over live `Map` and `Array` iterators while the `abort` action mutates them via `removeChatRun`, causing entries to be skipped. Snapshot the iterables before the loop. - Good bug fix in `chat-abort.ts`: corrects the first argument to `removeChatRun` from `runId` to `active.sessionId`. <h3>Confidence Score: 2/5</h3> - This PR has a definite runtime crash (`logGateway` ReferenceError) and a mutation-during-iteration bug that must be fixed before merging. - Score of 2 reflects two confirmed bugs: (1) an undefined variable reference that will crash at runtime on any stuck detection log, and (2) a collection mutation during iteration that silently skips entries when aborting stuck runs. The overall design and remaining changes are sound. - `src/gateway/server.impl.ts` (undefined `logGateway` reference) and `src/gateway/server-maintenance.ts` (mutation during iteration in abort path) <sub>Last reviewed commit: 7cf7f98</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs