#23760: fix(gateway-lock): use port binding as primary liveness signal
size: S
trusted-contributor
Cluster:
Gateway and Session Fixes
## Problem
The gateway lock uses a file in `/tmp` to enforce single-instance semantics. On clean shutdown `release()` deletes the lock — but `release()` only runs when Node.js can execute JavaScript. If the process is killed with SIGKILL (e.g. `kill -9`, OOM, WSL reset, or a forced stop before graceful shutdown completes) the lock file is left on disk permanently.
On the next start attempt:
1. Lock file exists → `acquireGatewayLock` enters its retry loop
2. `resolveGatewayOwnerStatus` calls `isPidAlive(pid)` — on Linux this uses `process.kill(pid, 0)`, which returns success for zombie processes
3. If the PID is in zombie state (parent has not yet called `wait()`), the process appears alive
4. `readLinuxStartTime` reads `/proc/<pid>/stat`, which is still intact for zombies, and the start-time matches the lock payload
5. Status returns `"alive"` → gateway refuses to start, even though nothing is actually running
The result is a permanently stuck lock that requires manual intervention.
*(Note: the zombie case in `isPidAlive` was separately fixed in commit 6eaf2baa. This PR addresses the underlying design gap that makes the system fragile against any ungraceful exit, not just zombie timing windows.)*
## Root cause
The lock relies entirely on userspace cleanup. Any kill signal that bypasses cleanup handlers leaves a stale lock, and the PID-based liveness check has no reliable way to distinguish "process is still running" from "process died and PID is being recycled or in zombie state."
## Fix
Port binding is **kernel-managed**: the OS releases it on any form of process termination, including SIGKILL, OOM, and hard resets. If the gateway port is free, no gateway is running — the lock file is definitionally stale.
This PR adds an optional `port` field to `GatewayLockOptions`. When supplied, `resolveGatewayOwnerStatus` runs a non-destructive TCP connect probe first. If the port is free, it returns `"dead"` immediately and the stale lock is cleared. The existing PID + startTime checks remain as identity guards (confirming the listening process is actually our gateway, not an unrelated service).
### Changes
- `GatewayLockOptions.port?: number` — optional gateway port
- `checkPortFree(port, host)` — probes via `net.createConnection` (connect-only, never binds)
- `resolveGatewayOwnerStatus` — made `async`, port check added as first branch
- Call site in `acquireGatewayLock` updated to `await` the now-async function
### Backward compatibility
When `port` is not passed the function skips the port check and falls through to the existing PID/startTime/argv logic unchanged. No existing behaviour is altered.
## Behaviour after this change
| Scenario | Before | After |
|---|---|---|
| Clean shutdown | lock released | lock released |
| SIGKILL | stale lock, manual fix | port free → stale detected instantly |
| WSL reset / OOM | stale lock | port free → stale detected instantly |
| Zombie PID window | detected as alive | port free → dead (port check runs first) |
| Port supplied, unrelated process on port | N/A | PID/startTime identity check catches it |
| Port not supplied | unchanged | unchanged fallback behaviour |
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Implemented kernel-managed port binding as the primary liveness check for gateway locks, replacing fragile PID-based detection that fails on SIGKILL/OOM/WSL resets. Port availability is checked first via TCP connect probe, with PID/startTime checks serving as identity guards. Backward compatible when `port` option is omitted.
- Added optional `port` field to `GatewayLockOptions`
- Implemented `checkPortFree()` to probe port availability via TCP connect
- Made `resolveGatewayOwnerStatus` async to support port check as first branch
- Updated call site to await the now-async function
**Critical issue:** `checkPortFree` lacks timeout - can hang indefinitely on connection attempts in half-open states.
<h3>Confidence Score: 3/5</h3>
- Architectural improvement is sound but missing timeout in port check creates hanging risk
- The port-based liveness approach is architecturally superior to PID checks and solves real SIGKILL/OOM issues. However, the `checkPortFree` function lacks a connection timeout, which can cause the lock acquisition to hang indefinitely if the connection attempt enters a half-open state. This is a critical flaw in what should be a reliable recovery mechanism.
- Pay close attention to `src/infra/gateway-lock.ts` - the `checkPortFree` function needs a timeout before production use
<sub>Last reviewed commit: 369ed63</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#11147: fix(daemon): stop gateway by port when no daemon service is active
by jasonthewhale · 2026-02-07
77.9%
#17835: Fix misleading gateway stop hints for standalone listeners
by ConnorCallison · 2026-02-16
77.8%
#9460: fix(gateway): clean up lock file on service stop
by zenchantlive · 2026-02-05
77.6%
#13084: fix(daemon): multi-layer defense against zombie gateway processes
by openperf · 2026-02-10
75.8%
#10259: fix(sessions): clean up orphaned .jsonl.lock files on startup (#10170)
by nu-gui · 2026-02-06
75.2%
#8745: fix(gateway): respect gateway.port config and --port CLI flag
by revenuestack · 2026-02-04
74.9%
#5014: fix(agents): detect PID reuse in session write lock
by shayan919293 · 2026-01-30
74.0%
#11455: fix(gateway): default gateway.mode to local when unset
by AnonO6 · 2026-02-07
73.5%
#19437: Gateway: respect custom bind host for local health/RPC target resol...
by frudas24 · 2026-02-17
73.3%
#21459: fix(gateway): resolve port from profile config, not inherited env
by kkeeling · 2026-02-19
73.1%