← Back to PRs

#13084: fix(daemon): multi-layer defense against zombie gateway processes

by openperf open 2026-02-10 03:27 View on GitHub →
gateway cli size: S
## Summary Fixes #13002 (macOS), #7999 (Linux), #11837 (Linux). When the gateway process gets stuck on a network call (e.g. Telegram long-polling timeout), it ignores `SIGTERM` because the Node.js event loop is blocked. The 5-second `setTimeout` force-exit in `run-loop.ts` also depends on the event loop, so it never fires. This leaves a zombie process holding port 18789, causing an infinite restart loop (one user reported 3,400+ restarts over 24 hours). ## Changes ### Layer 1: Service manager hardening (passive defense) | File | Change | |------|--------| | `src/daemon/launchd-plist.ts` | Add `ExitTimeOut=15` — macOS sends SIGKILL after 15 s | | `src/daemon/systemd-unit.ts` | Add `TimeoutStopSec=15` — systemd sends SIGKILL after 15 s | | `src/daemon/systemd-unit.ts` | Add `StartLimitIntervalSec=300` + `StartLimitBurst=10` — cap restarts at 10 per 5 min | ### Layer 2: CLI force-stop (active defense) | File | Change | |------|--------| | `src/cli/daemon-cli/types.ts` | Add `force?: boolean` to `DaemonLifecycleOptions` | | `src/cli/daemon-cli/register.ts` | Add `--force` flag to `daemon stop` and `daemon restart` | | `src/cli/daemon-cli/lifecycle.ts` | When `--force` is set, call `forceFreePortAndWait()` (SIGTERM → 700 ms → SIGKILL) to guarantee port is freed | ### Why two layers? - **Layer 1** catches the case where the service manager itself restarts the process (e.g. after a crash or `KeepAlive`). No CLI involvement needed. - **Layer 2** catches the case where a user runs `openclaw daemon stop` or `restart` manually, and the process doesn't die. The existing `forceFreePortAndWait()` helper (already used by `gateway start --force`) is reused here for consistency. ## Test plan - Added `src/daemon/launchd-plist.test.ts` (4 tests) - Extended `src/daemon/systemd-unit.test.ts` (3 new tests for `buildSystemdUnit`) - All existing tests pass <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR hardens gateway daemon shutdown/restart behavior to prevent “zombie” processes from causing infinite restart loops. - Service-manager layer: launchd plists now include `ExitTimeOut=15`, and systemd unit generation adds `TimeoutStopSec=15` plus restart throttling (`StartLimitIntervalSec=300`, `StartLimitBurst=10`). - CLI layer: `daemon stop`/`daemon restart` gain a `--force` flag which uses the existing `forceFreePortAndWait()` helper to escalate from SIGTERM to SIGKILL to ensure the gateway port is freed. - Tests: new launchd plist tests and additional systemd unit tests cover the new unit/plist fields. Integration-wise, the systemd and launchd generators are used by `src/daemon/systemd.ts` and `src/daemon/launchd.ts` during install, so the new timeout/throttling settings will be applied to newly generated service definitions, while the CLI flag provides an explicit operator tool for stuck processes. <h3>Confidence Score: 4/5</h3> - This PR is generally safe to merge, with a couple of correctness/clarity issues to address first. - Core changes are localized (unit/plist generation + optional CLI flag) and covered by new tests, but there is at least one definite typo in a key explanatory comment and the `--force` stop path can produce misleading JSON/service snapshots for consumers because it conflates “port/process killed” with “service unloaded”. - src/cli/daemon-cli/lifecycle.ts, src/daemon/systemd-unit.ts <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs