#10259: fix(sessions): clean up orphaned .jsonl.lock files on startup (#10170)
gateway
agents
stale
Cluster:
Gateway and Session Fixes
## Summary
- Stale `.lock` files from crashed gateway processes cause "request ended without getting any chunks" errors and permanently stuck sessions
- Add `cleanupOrphanedLocks()` to scan lock files, check if the owning PID is still alive, and remove orphaned locks
- Call cleanup on gateway startup in `startGatewaySidecars()` before other services initialize
Fixes #10170
## Test plan
- [x] 5 new tests for `cleanupOrphanedLocks()`: dead PID removal, corrupted payload, alive PID preservation, non-lock file filtering, non-existent directory
- [x] All 13 tests pass (8 original + 5 new)
- [x] `pnpm check` passes (0 warnings, 0 errors)
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
- Adds `cleanupOrphanedLocks()` to scan a sessions directory for `*.lock` files, validate their JSON payload, and remove locks deemed orphaned.
- Hooks the cleanup into gateway startup (`startGatewaySidecars`) before other sidecar services initialize to avoid stuck sessions after crashes.
- Extends `session-write-lock` tests with new cases covering dead/alive PIDs, corrupted payloads, non-lock files, and missing directories.
<h3>Confidence Score: 3/5</h3>
- This PR is mostly safe but has a real risk of deleting valid locks in some environments.
- The new startup cleanup improves resilience to crashes, but the liveness check uses `process.kill(pid, 0)` and treats any error as “dead”; in environments where the gateway user lacks permission to signal another user’s process, this can misclassify live processes and remove their locks, potentially allowing concurrent writers.
- src/agents/session-write-lock.ts
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#4044: fix: release session locks on SIGUSR1 restart + instance nonce for ...
by seanb4t · 2026-01-29
81.4%
#9460: fix(gateway): clean up lock file on service stop
by zenchantlive · 2026-02-05
80.1%
#20431: fix(sessions): add session contamination guards and self-leak lock ...
by marcomarandiz · 2026-02-18
80.1%
#5014: fix(agents): detect PID reuse in session write lock
by shayan919293 · 2026-01-30
76.6%
#23760: fix(gateway-lock): use port binding as primary liveness signal
by Operative-001 · 2026-02-22
75.2%
#8698: fix(cron): default enabled to true for new jobs
by emmick4 · 2026-02-04
74.8%
#15628: fix: resolve session write lock race condition
by 1kuna · 2026-02-13
74.5%
#13055: fix: prevent cron RPC stalls with timeout and caching (#13018)
by trevorgordon981 · 2026-02-10
74.4%
#17132: fix: filter out invalid session entries with empty sessionFile
by Limitless2023 · 2026-02-15
74.4%
#4664: fix: per-session metadata files to eliminate lock contention
by tsukhani · 2026-01-30
74.4%