← Back to PRs

#5014: fix(agents): detect PID reuse in session write lock

by shayan919293 open 2026-01-30 23:20 View on GitHub →
agents
## Summary Fixes the "session file locked (timeout 10000ms)" errors that occur after container rebuilds when PIDs get reused by different processes. ## Problem After a container rebuild, PIDs get reused. A stale `.lock` file from a crashed session may reference a PID that now belongs to a completely different process (e.g., the dashboard server). `acquireSessionWriteLock` used `process.kill(pid, 0)` to check if the lock holder is alive, but this only checks if *any* process with that PID exists — not whether it's the original lock holder. **Result:** The gateway would see a "live" PID and refuse to break the lock, requiring manual `.lock` file removal. ## Solution Store the process command name (`comm`) in the lock payload and validate it when checking if the lock holder is still alive. On Linux, we read `/proc/<pid>/comm` to verify the running process matches the expected command. If they don't match, the lock is treated as stale and reclaimed. ## Changes - **`src/agents/session-write-lock.ts`**: - Add `comm` field to `LockFilePayload` type - Add `getProcessComm()` to get current process command name (truncated to 15 chars to match Linux `/proc/pid/comm` format) - Add `getProcessCommForPid()` to read `/proc/<pid>/comm` on Linux - Update `isAlive()` to verify comm matches when available - Update lock creation to include `comm` - Update lock validation to pass `comm` to `isAlive()` - **`src/agents/session-write-lock.test.ts`**: - Add test: "includes comm in lock payload" - Add test: "reclaims lock when PID is reused by a different process" (Linux-only) - Add tests for `getProcessComm()`, `isAlive()` helpers ## Platform Support - **Linux**: Full support - reads `/proc/<pid>/comm` to verify process identity - **macOS/Windows**: Graceful fallback - continues to use PID-only check (same behavior as before) ## Testing - All 13 tests pass (8 original + 5 new) - Type checking passes - Lint passes Closes #5006 <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR extends the session write-lock mechanism by writing the lock holder’s PID plus a `comm` identifier into the `.lock` payload, then using `/proc/<pid>/comm` (Linux) to detect PID reuse and reclaim locks after container rebuilds. Core logic lives in `src/agents/session-write-lock.ts`, with accompanying new tests in `src/agents/session-write-lock.test.ts` validating the new payload field and the PID-reuse reclaim behavior on Linux. <h3>Confidence Score: 3/5</h3> - This PR is close to safe to merge, but the Linux `comm` matching likely causes false stale-lock reclaims in common Node invocation modes. - The approach is reasonable, tests are added, and non-Linux behavior is unchanged, but `getProcessComm()` appears to record a script basename whereas `/proc/<pid>/comm` typically reports the executable/task name (often `node`). That mismatch would make `isAlive(pid, expectedComm)` return false for a live lock on Linux, which is a functional regression for locking correctness. - src/agents/session-write-lock.ts (comm computation vs /proc semantics); src/agents/session-write-lock.test.ts (update tests once comm semantics are corrected) <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs