#10283: fix(agents): close TOCTOU race in session write lock acquisition
agents
stale
Cluster:
Session Lock Improvements
## Problem
`acquireSessionWriteLock` has a TOCTOU (Time-of-Check to Time-of-Use) race condition when taking over stale locks:
```typescript
if (stale || !alive) {
await fs.rm(lockPath, { force: true }); // Step 1: Delete stale lock
continue; // Step 2: Loop back to create new lock
}
// RACE WINDOW: Another process can create the lock between steps 1 and 2
```
This can cause two processes to both believe they hold the lock, leading to data corruption in session files.
## Scenario
1. Process A detects stale lock, deletes it
2. Process B (running concurrently) also detects stale lock, deletes it
3. Process A creates new lock file
4. Process B creates new lock file (overwrites A's lock!)
5. Both processes think they have exclusive access
## Solution
Use atomic rename instead of delete+create:
1. Create a temporary lock file with unique name (`{lockPath}.{pid}.{timestamp}`)
2. Use `fs.rename()` to atomically replace the stale lock
3. If rename fails (another process won), clean up temp file and retry
`fs.rename()` is atomic on POSIX systems, closing the race window.
## Impact
- Prevents potential data corruption in concurrent session access
- No behavioral change for single-process usage
- Minimal performance impact (one extra file operation only when taking over stale locks)
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
- Updates `acquireSessionWriteLock` to attempt “atomic” stale-lock takeover by creating a temp lock file and renaming it over the existing lock.
- Adds temp-file cleanup and retries when takeover fails.
- Keeps the existing in-process reentrancy counter via `HELD_LOCKS` and existing release behavior.
- Main concern: the new rename-based takeover does not guarantee exclusivity on POSIX and the post-takeover `r+` open no longer represents an exclusive lock, so concurrent writers can still occur.
<h3>Confidence Score: 2/5</h3>
- This PR is not safe to merge as-is due to a still-present race that can allow dual lock owners during stale-lock takeover.
- The intended TOCTOU fix relies on `rename()` overwriting the destination; on POSIX this can let multiple contenders succeed (last writer wins), and the stored `r+` handle after takeover doesn’t provide exclusivity. That means the core correctness guarantee (single writer) is not restored under contention.
- src/agents/session-write-lock.ts
<!-- greptile_other_comments_section -->
<sub>(5/5) You can turn off certain types of comments like style [here](https://app.greptile.com/review/github)!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#15628: fix: resolve session write lock race condition
by 1kuna · 2026-02-13
84.6%
#4044: fix: release session locks on SIGUSR1 restart + instance nonce for ...
by seanb4t · 2026-01-29
79.2%
#5014: fix(agents): detect PID reuse in session write lock
by shayan919293 · 2026-01-30
77.7%
#15882: fix: move session entry computation inside store lock to prevent ra...
by cloorus · 2026-02-14
77.4%
#21828: fix: acquire session write lock in delivery mirror and gateway chat...
by inkolin · 2026-02-20
77.4%
#20431: fix(sessions): add session contamination guards and self-leak lock ...
by marcomarandiz · 2026-02-18
74.5%
#4664: fix: per-session metadata files to eliminate lock contention
by tsukhani · 2026-01-30
74.4%
#10725: fix: re-read session store inside lock in updateSessionStoreEntry
by Yida-Dev · 2026-02-06
73.5%
#10259: fix(sessions): clean up orphaned .jsonl.lock files on startup (#10170)
by nu-gui · 2026-02-06
73.2%
#11873: fix: eliminate TOCTOU race in readExecApprovalsSnapshot
by Yida-Dev · 2026-02-08
72.6%