#4044: fix: release session locks on SIGUSR1 restart + instance nonce for stale detection
gateway
agents
Cluster:
Session Lock Improvements
## Summary
Fixes #4043 — stale session write locks persist after SIGUSR1 in-process restart in containers (PID 1), blocking session access for up to 30 minutes.
## Changes
Two-layer fix across 3 files:
### Layer 1: Explicit cleanup on shutdown (`server-close.ts`)
- Call `releaseAllSessionWriteLocks()` after `chatRunState.clear()` during server shutdown
- Wrapped in try-catch so cleanup failure cannot prevent shutdown
- This is the primary fix — locks are actively released before the new server iteration starts
### Layer 2: Instance nonce for stale detection (`session-write-lock.ts`)
- Add a per-instance nonce (`instanceNonce`) to lock file payloads: `{ pid, nonce, createdAt }`
- Nonce is rotated via `resetInstanceNonce()` at the end of `releaseAllSessionWriteLocks()`
- On lock acquisition, if the on-disk nonce differs from the current instance nonce and the PID matches → lock is treated as stale and immediately reclaimed
- Defense-in-depth: catches any locks that survive the explicit cleanup (e.g. due to fs errors or race conditions during shutdown)
### Key design decisions
- **Mutable nonce, not const**: ESM modules are cached for the process lifetime and are NOT re-evaluated on SIGUSR1 in-process restart. The nonce must be explicitly rotated.
- **Nonce rotated AFTER lock cleanup**: Prevents a race where a concurrent acquirer could see the old nonce as stale and reclaim a lock before cleanup finishes.
- **Backward compatible**: Lock files without a nonce (from older versions) fall through to the existing `pid + staleMs` checks unchanged.
## Testing
13 tests covering:
- Lock release on all termination signals (SIGINT, SIGTERM, SIGQUIT, SIGABRT)
- `releaseAllSessionWriteLocks()` removes all held locks
- Nonce mismatch reclaims stale locks from previous iterations
- Matching nonce preserves locks from current instance
- No-nonce fallback for backward compatibility
- Nonce rotation on `releaseAllSessionWriteLocks()`
All existing tests continue to pass.
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR addresses stale session write locks that can persist across in-process restarts (SIGUSR1), especially in container setups where the PID may remain 1. It does this by (1) explicitly releasing all held session write locks during gateway shutdown (`src/gateway/server-close.ts`) and (2) adding a per-process “instance nonce” to lock file payloads to detect and reclaim locks left behind by a prior server iteration even when the PID is unchanged (`src/agents/session-write-lock.ts`). Tests in `src/agents/session-write-lock.test.ts` cover nonce mismatch behavior, backward compatibility for nonce-less lock files, and the new “release all locks” behavior.
<h3>Confidence Score: 4/5</h3>
- Mostly safe to merge, with one correctness issue in the new lock-release helper that may leave some locks behind in multi-lock scenarios.
- The changes are small and well-tested, but `releaseAllSessionWriteLocks()` (and the existing sync variant) mutate `HELD_LOCKS` while iterating it, which can skip entries and undermine the primary shutdown cleanup path.
- src/agents/session-write-lock.ts
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#15628: fix: resolve session write lock race condition
by 1kuna · 2026-02-13
82.3%
#5014: fix(agents): detect PID reuse in session write lock
by shayan919293 · 2026-01-30
82.2%
#10259: fix(sessions): clean up orphaned .jsonl.lock files on startup (#10170)
by nu-gui · 2026-02-06
81.4%
#20431: fix(sessions): add session contamination guards and self-leak lock ...
by marcomarandiz · 2026-02-18
80.3%
#10283: fix(agents): close TOCTOU race in session write lock acquisition
by programming-pupil · 2026-02-06
79.2%
#4664: fix: per-session metadata files to eliminate lock contention
by tsukhani · 2026-01-30
79.1%
#21828: fix: acquire session write lock in delivery mirror and gateway chat...
by inkolin · 2026-02-20
78.8%
#15882: fix: move session entry computation inside store lock to prevent ra...
by cloorus · 2026-02-14
77.7%
#13881: fix: Address Greptile feedback - test isolation and channel resolution
by trevorgordon981 · 2026-02-11
76.7%
#11273: fix(telegram): prevent status command crash during thinking
by avirweb · 2026-02-07
75.3%