← Back to PRs

#23760: fix(gateway-lock): use port binding as primary liveness signal

by Operative-001 open 2026-02-22 17:26 View on GitHub →
size: S trusted-contributor
## Problem The gateway lock uses a file in `/tmp` to enforce single-instance semantics. On clean shutdown `release()` deletes the lock — but `release()` only runs when Node.js can execute JavaScript. If the process is killed with SIGKILL (e.g. `kill -9`, OOM, WSL reset, or a forced stop before graceful shutdown completes) the lock file is left on disk permanently. On the next start attempt: 1. Lock file exists → `acquireGatewayLock` enters its retry loop 2. `resolveGatewayOwnerStatus` calls `isPidAlive(pid)` — on Linux this uses `process.kill(pid, 0)`, which returns success for zombie processes 3. If the PID is in zombie state (parent has not yet called `wait()`), the process appears alive 4. `readLinuxStartTime` reads `/proc/<pid>/stat`, which is still intact for zombies, and the start-time matches the lock payload 5. Status returns `"alive"` → gateway refuses to start, even though nothing is actually running The result is a permanently stuck lock that requires manual intervention. *(Note: the zombie case in `isPidAlive` was separately fixed in commit 6eaf2baa. This PR addresses the underlying design gap that makes the system fragile against any ungraceful exit, not just zombie timing windows.)* ## Root cause The lock relies entirely on userspace cleanup. Any kill signal that bypasses cleanup handlers leaves a stale lock, and the PID-based liveness check has no reliable way to distinguish "process is still running" from "process died and PID is being recycled or in zombie state." ## Fix Port binding is **kernel-managed**: the OS releases it on any form of process termination, including SIGKILL, OOM, and hard resets. If the gateway port is free, no gateway is running — the lock file is definitionally stale. This PR adds an optional `port` field to `GatewayLockOptions`. When supplied, `resolveGatewayOwnerStatus` runs a non-destructive TCP connect probe first. If the port is free, it returns `"dead"` immediately and the stale lock is cleared. The existing PID + startTime checks remain as identity guards (confirming the listening process is actually our gateway, not an unrelated service). ### Changes - `GatewayLockOptions.port?: number` — optional gateway port - `checkPortFree(port, host)` — probes via `net.createConnection` (connect-only, never binds) - `resolveGatewayOwnerStatus` — made `async`, port check added as first branch - Call site in `acquireGatewayLock` updated to `await` the now-async function ### Backward compatibility When `port` is not passed the function skips the port check and falls through to the existing PID/startTime/argv logic unchanged. No existing behaviour is altered. ## Behaviour after this change | Scenario | Before | After | |---|---|---| | Clean shutdown | lock released | lock released | | SIGKILL | stale lock, manual fix | port free → stale detected instantly | | WSL reset / OOM | stale lock | port free → stale detected instantly | | Zombie PID window | detected as alive | port free → dead (port check runs first) | | Port supplied, unrelated process on port | N/A | PID/startTime identity check catches it | | Port not supplied | unchanged | unchanged fallback behaviour | <!-- greptile_comment --> <h3>Greptile Summary</h3> Implemented kernel-managed port binding as the primary liveness check for gateway locks, replacing fragile PID-based detection that fails on SIGKILL/OOM/WSL resets. Port availability is checked first via TCP connect probe, with PID/startTime checks serving as identity guards. Backward compatible when `port` option is omitted. - Added optional `port` field to `GatewayLockOptions` - Implemented `checkPortFree()` to probe port availability via TCP connect - Made `resolveGatewayOwnerStatus` async to support port check as first branch - Updated call site to await the now-async function **Critical issue:** `checkPortFree` lacks timeout - can hang indefinitely on connection attempts in half-open states. <h3>Confidence Score: 3/5</h3> - Architectural improvement is sound but missing timeout in port check creates hanging risk - The port-based liveness approach is architecturally superior to PID checks and solves real SIGKILL/OOM issues. However, the `checkPortFree` function lacks a connection timeout, which can cause the lock acquisition to hang indefinitely if the connection attempt enters a half-open state. This is a critical flaw in what should be a reliable recovery mechanism. - Pay close attention to `src/infra/gateway-lock.ts` - the `checkPortFree` function needs a timeout before production use <sub>Last reviewed commit: 369ed63</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs