← Back to PRs

#4044: fix: release session locks on SIGUSR1 restart + instance nonce for stale detection

by seanb4t open 2026-01-29 15:45 View on GitHub →
gateway agents
## Summary Fixes #4043 — stale session write locks persist after SIGUSR1 in-process restart in containers (PID 1), blocking session access for up to 30 minutes. ## Changes Two-layer fix across 3 files: ### Layer 1: Explicit cleanup on shutdown (`server-close.ts`) - Call `releaseAllSessionWriteLocks()` after `chatRunState.clear()` during server shutdown - Wrapped in try-catch so cleanup failure cannot prevent shutdown - This is the primary fix — locks are actively released before the new server iteration starts ### Layer 2: Instance nonce for stale detection (`session-write-lock.ts`) - Add a per-instance nonce (`instanceNonce`) to lock file payloads: `{ pid, nonce, createdAt }` - Nonce is rotated via `resetInstanceNonce()` at the end of `releaseAllSessionWriteLocks()` - On lock acquisition, if the on-disk nonce differs from the current instance nonce and the PID matches → lock is treated as stale and immediately reclaimed - Defense-in-depth: catches any locks that survive the explicit cleanup (e.g. due to fs errors or race conditions during shutdown) ### Key design decisions - **Mutable nonce, not const**: ESM modules are cached for the process lifetime and are NOT re-evaluated on SIGUSR1 in-process restart. The nonce must be explicitly rotated. - **Nonce rotated AFTER lock cleanup**: Prevents a race where a concurrent acquirer could see the old nonce as stale and reclaim a lock before cleanup finishes. - **Backward compatible**: Lock files without a nonce (from older versions) fall through to the existing `pid + staleMs` checks unchanged. ## Testing 13 tests covering: - Lock release on all termination signals (SIGINT, SIGTERM, SIGQUIT, SIGABRT) - `releaseAllSessionWriteLocks()` removes all held locks - Nonce mismatch reclaims stale locks from previous iterations - Matching nonce preserves locks from current instance - No-nonce fallback for backward compatibility - Nonce rotation on `releaseAllSessionWriteLocks()` All existing tests continue to pass. <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR addresses stale session write locks that can persist across in-process restarts (SIGUSR1), especially in container setups where the PID may remain 1. It does this by (1) explicitly releasing all held session write locks during gateway shutdown (`src/gateway/server-close.ts`) and (2) adding a per-process “instance nonce” to lock file payloads to detect and reclaim locks left behind by a prior server iteration even when the PID is unchanged (`src/agents/session-write-lock.ts`). Tests in `src/agents/session-write-lock.test.ts` cover nonce mismatch behavior, backward compatibility for nonce-less lock files, and the new “release all locks” behavior. <h3>Confidence Score: 4/5</h3> - Mostly safe to merge, with one correctness issue in the new lock-release helper that may leave some locks behind in multi-lock scenarios. - The changes are small and well-tested, but `releaseAllSessionWriteLocks()` (and the existing sync variant) mutate `HELD_LOCKS` while iterating it, which can skip entries and undermine the primary shutdown cleanup path. - src/agents/session-write-lock.ts <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs