#20431: fix(sessions): add session contamination guards and self-leak lock detection
commands
agents
size: S
Cluster:
Session Lock Improvements
## Summary
- Add `isContaminatedSessionId()` and `isContaminatedSessionFile()` guards to detect and recover from corrupted session store entries
- Add self-leak detection in `acquireSessionWriteLock()` so a long-running process can break its own orphaned locks immediately instead of waiting for the 30-minute stale timeout
- Apply guards in both `auto-reply/reply/session.ts` and `commands/agent/session.ts` (the two session resolution entry points)
## Background: how this was discovered
A production gateway running `openclaw-gateway` (PID 74269) started failing all agent activity with:
```
⚠️ Agent failed before reply: session file locked (timeout 10000ms): pid=74269
~/.openclaw/agents/voice/sessions/remediation:t2:414ee52d.jsonl.lock
```
Both haiku and sonnet models failed identically — the lock blocked ALL agent activity on the main session. The gateway was deadlocking against itself.
## Root cause: session store contamination
Tracing the lock contention revealed a **lane/lock mismatch**. The command lane was `session:agent:voice:main` (the main user-facing session), but the lock was on `remediation:t2:414ee52d.jsonl` — a completely different session's transcript file.
Inspecting `sessions.json` showed the source of the mismatch:
```json
{
"agent:voice:main": {
"sessionId": "boot-2026-02-18_21-46-54-108-802b16db",
"sessionFile": "remediation:t2:414ee52d.jsonl"
}
}
```
The `sessionFile` field pointed to a **different session's transcript**. When the main session resolved its file path via `resolveSessionFilePath()`, it used the stored `sessionFile` directly (since the field was already set), acquiring a lock on the wrong `.jsonl` file. When a remediation run legitimately held that same lock, the main session timed out after 10s.
This is a variant of session contamination (related to ADR-0037) that the existing code did not guard against. Two contamination vectors exist:
1. **sessionId contamination**: A delegate/subagent spawn ID (e.g. `delegate-1739...`) leaks into a non-subagent session key's store entry
2. **sessionFile contamination** (new): The sessionId looks normal but the sessionFile points to a completely different session's transcript
The `boot.ts` startup sequence (snapshot → run BOOT.md → restore) preserved the contaminated `sessionFile` because the snapshot/restore only checks for entry presence, not field coherence.
## Cascading failures from sessionFile contamination
When a session entry has a contaminated `sessionFile`:
1. **Wrong lock target**: `acquireSessionWriteLock()` locks the wrong `.jsonl` file, causing cross-session lock contention
2. **Transcript corruption**: Agent turns append to the wrong session's transcript, mixing conversation history across unrelated sessions
3. **Self-deadlock**: If `release()` previously failed after `HELD_LOCKS.delete()` (e.g. `handle.close()` threw), the lock file stays on disk with the gateway's own PID. On the next turn, the gateway sees its own PID as alive and waits forever — stale detection can't help because the PID IS alive
4. **Total agent lockout**: All agent activity on the affected session key blocks until the 30-minute stale timeout expires or the gateway restarts
## Changes
### 1. `session-key-utils.ts` — contamination detection functions
- `SPAWN_SESSION_ID_PATTERN`: Regex matching delegate/routed/subagent spawn IDs
- `isContaminatedSessionId(sessionKey, sessionId)`: Returns true if sessionId matches a spawn pattern but sessionKey is NOT a subagent key
- `isContaminatedSessionFile(sessionId, sessionFile)`: Returns true if sessionFile's basename doesn't contain the sessionId — indicating the file path belongs to a different session
### 2. `auto-reply/reply/session.ts` — gateway message path guard
Adds both contamination checks after freshness evaluation in `initSessionState()`. On detection:
- Logs a warning with the contaminated values
- Clears the `sessionFile` field so `resolveSessionFilePath()` regenerates it from the correct sessionId
- For sessionId contamination: forces a new session (new UUID)
### 3. `commands/agent/session.ts` — CLI/agent command path guard
Same contamination checks in `resolveSession()`, covering the `agentCommand()` entry point used by boot runs, cron jobs, and CLI invocations.
### 4. `session-write-lock.ts` — self-leak detection
In the acquisition polling loop, detects when a lock file was written by the current process (`pid === process.pid`) but isn't tracked in `HELD_LOCKS`. This indicates a leaked lock from a previous agent run whose `release()` failed. Breaks the lock immediately instead of waiting for stale timeout.
This complements the existing try/catch fix in `release()` (merged in #18096) — even with bulletproof release, a crash or unhandled rejection during release could still leave an orphaned lock.
## Related
- #7222 — reports the same class of self-deadlock bug
- #18096 — merged fix for release() error handling (our self-leak detection is defense-in-depth on top of this)
- #15628 — open PR for release race condition (overlapping scope)
## Test plan
- [ ] Verify `pnpm exec tsc` passes (only pre-existing errors in test files, none from these changes)
- [ ] Gateway starts cleanly after deploy
- [ ] No `session file locked` errors during normal agent operation
- [ ] Contamination warning logs appear correctly when injecting a bad `sessionFile` into sessions.json
- [ ] Self-leaked lock detection triggers correctly when a `.lock` file with the current PID exists but isn't in HELD_LOCKS
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Added session contamination guards and self-leak lock detection to prevent deadlocks from corrupted session store entries.
**Key changes:**
- Introduced `isContaminatedSessionId()` and `isContaminatedSessionFile()` guards in `session-key-utils.ts` to detect mismatched session entries
- Applied contamination checks in both gateway message path (`auto-reply/reply/session.ts`) and CLI/agent command path (`commands/agent/session.ts`)
- Added self-leak detection in lock acquisition to break orphaned locks from the same process immediately
- Prevents cross-session lock contention and transcript corruption from sessionFile contamination
**Root cause addressed:**
Production gateway experienced self-deadlocks when a session entry's `sessionFile` pointed to a different session's transcript. The main session would lock the wrong file, blocking legitimate operations on that transcript. The contamination guards clear corrupted `sessionFile` entries, forcing regeneration from the correct `sessionId`.
The self-leak detection complements existing error handling by detecting locks written by the current PID but not tracked in `HELD_LOCKS` (indicating a previous release failure), breaking them immediately instead of waiting 30 minutes.
<h3>Confidence Score: 4/5</h3>
- Safe to merge with low risk - addresses critical production deadlock and adds defensive guards
- The changes directly address a documented production issue with clear root cause analysis. The contamination detection logic is sound and the self-leak detection provides defense-in-depth. Minor style suggestion on cross-platform path handling. No tests added but the guards are conservative (clearing fields to force regeneration rather than throwing errors).
- No files require special attention - all changes are defensive guards with safe fallback behavior
<sub>Last reviewed commit: 21d7d08</sub>
<!-- greptile_other_comments_section -->
<sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#21828: fix: acquire session write lock in delivery mirror and gateway chat...
by inkolin · 2026-02-20
84.0%
#17132: fix: filter out invalid session entries with empty sessionFile
by Limitless2023 · 2026-02-15
80.6%
#4044: fix: release session locks on SIGUSR1 restart + instance nonce for ...
by seanb4t · 2026-01-29
80.3%
#15176: fix(sessions): allow channel-routed session IDs and cross-agent paths
by cathrynlavery · 2026-02-13
80.2%
#10259: fix(sessions): clean up orphaned .jsonl.lock files on startup (#10170)
by nu-gui · 2026-02-06
80.1%
#23583: fix(agents): catch session JSONL write failures instead of crashing
by mohandshamada · 2026-02-22
79.5%
#16061: fix(sessions): tolerate invalid sessionFile metadata
by haoyifan · 2026-02-14
79.4%
#15050: fix: transcript corruption resilience — strip aborted tool_use bloc...
by yashchitneni · 2026-02-12
79.3%
#4664: fix: per-session metadata files to eliminate lock contention
by tsukhani · 2026-01-30
79.1%
#15882: fix: move session entry computation inside store lock to prevent ra...
by cloorus · 2026-02-14
79.0%