#5014: fix(agents): detect PID reuse in session write lock
agents
Cluster:
Session Lock Improvements
## Summary
Fixes the "session file locked (timeout 10000ms)" errors that occur after container rebuilds when PIDs get reused by different processes.
## Problem
After a container rebuild, PIDs get reused. A stale `.lock` file from a crashed session may reference a PID that now belongs to a completely different process (e.g., the dashboard server).
`acquireSessionWriteLock` used `process.kill(pid, 0)` to check if the lock holder is alive, but this only checks if *any* process with that PID exists — not whether it's the original lock holder.
**Result:** The gateway would see a "live" PID and refuse to break the lock, requiring manual `.lock` file removal.
## Solution
Store the process command name (`comm`) in the lock payload and validate it when checking if the lock holder is still alive. On Linux, we read `/proc/<pid>/comm` to verify the running process matches the expected command. If they don't match, the lock is treated as stale and reclaimed.
## Changes
- **`src/agents/session-write-lock.ts`**:
- Add `comm` field to `LockFilePayload` type
- Add `getProcessComm()` to get current process command name (truncated to 15 chars to match Linux `/proc/pid/comm` format)
- Add `getProcessCommForPid()` to read `/proc/<pid>/comm` on Linux
- Update `isAlive()` to verify comm matches when available
- Update lock creation to include `comm`
- Update lock validation to pass `comm` to `isAlive()`
- **`src/agents/session-write-lock.test.ts`**:
- Add test: "includes comm in lock payload"
- Add test: "reclaims lock when PID is reused by a different process" (Linux-only)
- Add tests for `getProcessComm()`, `isAlive()` helpers
## Platform Support
- **Linux**: Full support - reads `/proc/<pid>/comm` to verify process identity
- **macOS/Windows**: Graceful fallback - continues to use PID-only check (same behavior as before)
## Testing
- All 13 tests pass (8 original + 5 new)
- Type checking passes
- Lint passes
Closes #5006
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR extends the session write-lock mechanism by writing the lock holder’s PID plus a `comm` identifier into the `.lock` payload, then using `/proc/<pid>/comm` (Linux) to detect PID reuse and reclaim locks after container rebuilds.
Core logic lives in `src/agents/session-write-lock.ts`, with accompanying new tests in `src/agents/session-write-lock.test.ts` validating the new payload field and the PID-reuse reclaim behavior on Linux.
<h3>Confidence Score: 3/5</h3>
- This PR is close to safe to merge, but the Linux `comm` matching likely causes false stale-lock reclaims in common Node invocation modes.
- The approach is reasonable, tests are added, and non-Linux behavior is unchanged, but `getProcessComm()` appears to record a script basename whereas `/proc/<pid>/comm` typically reports the executable/task name (often `node`). That mismatch would make `isAlive(pid, expectedComm)` return false for a live lock on Linux, which is a functional regression for locking correctness.
- src/agents/session-write-lock.ts (comm computation vs /proc semantics); src/agents/session-write-lock.test.ts (update tests once comm semantics are corrected)
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#4044: fix: release session locks on SIGUSR1 restart + instance nonce for ...
by seanb4t · 2026-01-29
82.2%
#15628: fix: resolve session write lock race condition
by 1kuna · 2026-02-13
77.9%
#10283: fix(agents): close TOCTOU race in session write lock acquisition
by programming-pupil · 2026-02-06
77.7%
#20431: fix(sessions): add session contamination guards and self-leak lock ...
by marcomarandiz · 2026-02-18
77.4%
#10259: fix(sessions): clean up orphaned .jsonl.lock files on startup (#10170)
by nu-gui · 2026-02-06
76.6%
#21828: fix: acquire session write lock in delivery mirror and gateway chat...
by inkolin · 2026-02-20
75.6%
#4664: fix: per-session metadata files to eliminate lock contention
by tsukhani · 2026-01-30
75.5%
#15882: fix: move session entry computation inside store lock to prevent ra...
by cloorus · 2026-02-14
74.8%
#23760: fix(gateway-lock): use port binding as primary liveness signal
by Operative-001 · 2026-02-22
74.0%
#13881: fix: Address Greptile feedback - test isolation and channel resolution
by trevorgordon981 · 2026-02-11
73.5%