← Back to PRs

#20431: fix(sessions): add session contamination guards and self-leak lock detection

by marcomarandiz open 2026-02-18 22:33 View on GitHub →
commands agents size: S
## Summary - Add `isContaminatedSessionId()` and `isContaminatedSessionFile()` guards to detect and recover from corrupted session store entries - Add self-leak detection in `acquireSessionWriteLock()` so a long-running process can break its own orphaned locks immediately instead of waiting for the 30-minute stale timeout - Apply guards in both `auto-reply/reply/session.ts` and `commands/agent/session.ts` (the two session resolution entry points) ## Background: how this was discovered A production gateway running `openclaw-gateway` (PID 74269) started failing all agent activity with: ``` ⚠️ Agent failed before reply: session file locked (timeout 10000ms): pid=74269 ~/.openclaw/agents/voice/sessions/remediation:t2:414ee52d.jsonl.lock ``` Both haiku and sonnet models failed identically — the lock blocked ALL agent activity on the main session. The gateway was deadlocking against itself. ## Root cause: session store contamination Tracing the lock contention revealed a **lane/lock mismatch**. The command lane was `session:agent:voice:main` (the main user-facing session), but the lock was on `remediation:t2:414ee52d.jsonl` — a completely different session's transcript file. Inspecting `sessions.json` showed the source of the mismatch: ```json { "agent:voice:main": { "sessionId": "boot-2026-02-18_21-46-54-108-802b16db", "sessionFile": "remediation:t2:414ee52d.jsonl" } } ``` The `sessionFile` field pointed to a **different session's transcript**. When the main session resolved its file path via `resolveSessionFilePath()`, it used the stored `sessionFile` directly (since the field was already set), acquiring a lock on the wrong `.jsonl` file. When a remediation run legitimately held that same lock, the main session timed out after 10s. This is a variant of session contamination (related to ADR-0037) that the existing code did not guard against. Two contamination vectors exist: 1. **sessionId contamination**: A delegate/subagent spawn ID (e.g. `delegate-1739...`) leaks into a non-subagent session key's store entry 2. **sessionFile contamination** (new): The sessionId looks normal but the sessionFile points to a completely different session's transcript The `boot.ts` startup sequence (snapshot → run BOOT.md → restore) preserved the contaminated `sessionFile` because the snapshot/restore only checks for entry presence, not field coherence. ## Cascading failures from sessionFile contamination When a session entry has a contaminated `sessionFile`: 1. **Wrong lock target**: `acquireSessionWriteLock()` locks the wrong `.jsonl` file, causing cross-session lock contention 2. **Transcript corruption**: Agent turns append to the wrong session's transcript, mixing conversation history across unrelated sessions 3. **Self-deadlock**: If `release()` previously failed after `HELD_LOCKS.delete()` (e.g. `handle.close()` threw), the lock file stays on disk with the gateway's own PID. On the next turn, the gateway sees its own PID as alive and waits forever — stale detection can't help because the PID IS alive 4. **Total agent lockout**: All agent activity on the affected session key blocks until the 30-minute stale timeout expires or the gateway restarts ## Changes ### 1. `session-key-utils.ts` — contamination detection functions - `SPAWN_SESSION_ID_PATTERN`: Regex matching delegate/routed/subagent spawn IDs - `isContaminatedSessionId(sessionKey, sessionId)`: Returns true if sessionId matches a spawn pattern but sessionKey is NOT a subagent key - `isContaminatedSessionFile(sessionId, sessionFile)`: Returns true if sessionFile's basename doesn't contain the sessionId — indicating the file path belongs to a different session ### 2. `auto-reply/reply/session.ts` — gateway message path guard Adds both contamination checks after freshness evaluation in `initSessionState()`. On detection: - Logs a warning with the contaminated values - Clears the `sessionFile` field so `resolveSessionFilePath()` regenerates it from the correct sessionId - For sessionId contamination: forces a new session (new UUID) ### 3. `commands/agent/session.ts` — CLI/agent command path guard Same contamination checks in `resolveSession()`, covering the `agentCommand()` entry point used by boot runs, cron jobs, and CLI invocations. ### 4. `session-write-lock.ts` — self-leak detection In the acquisition polling loop, detects when a lock file was written by the current process (`pid === process.pid`) but isn't tracked in `HELD_LOCKS`. This indicates a leaked lock from a previous agent run whose `release()` failed. Breaks the lock immediately instead of waiting for stale timeout. This complements the existing try/catch fix in `release()` (merged in #18096) — even with bulletproof release, a crash or unhandled rejection during release could still leave an orphaned lock. ## Related - #7222 — reports the same class of self-deadlock bug - #18096 — merged fix for release() error handling (our self-leak detection is defense-in-depth on top of this) - #15628 — open PR for release race condition (overlapping scope) ## Test plan - [ ] Verify `pnpm exec tsc` passes (only pre-existing errors in test files, none from these changes) - [ ] Gateway starts cleanly after deploy - [ ] No `session file locked` errors during normal agent operation - [ ] Contamination warning logs appear correctly when injecting a bad `sessionFile` into sessions.json - [ ] Self-leaked lock detection triggers correctly when a `.lock` file with the current PID exists but isn't in HELD_LOCKS 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- greptile_comment --> <h3>Greptile Summary</h3> Added session contamination guards and self-leak lock detection to prevent deadlocks from corrupted session store entries. **Key changes:** - Introduced `isContaminatedSessionId()` and `isContaminatedSessionFile()` guards in `session-key-utils.ts` to detect mismatched session entries - Applied contamination checks in both gateway message path (`auto-reply/reply/session.ts`) and CLI/agent command path (`commands/agent/session.ts`) - Added self-leak detection in lock acquisition to break orphaned locks from the same process immediately - Prevents cross-session lock contention and transcript corruption from sessionFile contamination **Root cause addressed:** Production gateway experienced self-deadlocks when a session entry's `sessionFile` pointed to a different session's transcript. The main session would lock the wrong file, blocking legitimate operations on that transcript. The contamination guards clear corrupted `sessionFile` entries, forcing regeneration from the correct `sessionId`. The self-leak detection complements existing error handling by detecting locks written by the current PID but not tracked in `HELD_LOCKS` (indicating a previous release failure), breaking them immediately instead of waiting 30 minutes. <h3>Confidence Score: 4/5</h3> - Safe to merge with low risk - addresses critical production deadlock and adds defensive guards - The changes directly address a documented production issue with clear root cause analysis. The contamination detection logic is sound and the self-leak detection provides defense-in-depth. Minor style suggestion on cross-platform path handling. No tests added but the guards are conservative (clearing fields to force regeneration rather than throwing errors). - No files require special attention - all changes are defensive guards with safe fallback behavior <sub>Last reviewed commit: 21d7d08</sub> <!-- greptile_other_comments_section --> <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub> <!-- /greptile_comment -->

Most Similar PRs