#13084: fix(daemon): multi-layer defense against zombie gateway processes
gateway
cli
size: S
Cluster:
Gateway and macOS Improvements
## Summary
Fixes #13002 (macOS), #7999 (Linux), #11837 (Linux).
When the gateway process gets stuck on a network call (e.g. Telegram long-polling timeout), it ignores `SIGTERM` because the Node.js event loop is blocked. The 5-second `setTimeout` force-exit in `run-loop.ts` also depends on the event loop, so it never fires. This leaves a zombie process holding port 18789, causing an infinite restart loop (one user reported 3,400+ restarts over 24 hours).
## Changes
### Layer 1: Service manager hardening (passive defense)
| File | Change |
|------|--------|
| `src/daemon/launchd-plist.ts` | Add `ExitTimeOut=15` — macOS sends SIGKILL after 15 s |
| `src/daemon/systemd-unit.ts` | Add `TimeoutStopSec=15` — systemd sends SIGKILL after 15 s |
| `src/daemon/systemd-unit.ts` | Add `StartLimitIntervalSec=300` + `StartLimitBurst=10` — cap restarts at 10 per 5 min |
### Layer 2: CLI force-stop (active defense)
| File | Change |
|------|--------|
| `src/cli/daemon-cli/types.ts` | Add `force?: boolean` to `DaemonLifecycleOptions` |
| `src/cli/daemon-cli/register.ts` | Add `--force` flag to `daemon stop` and `daemon restart` |
| `src/cli/daemon-cli/lifecycle.ts` | When `--force` is set, call `forceFreePortAndWait()` (SIGTERM → 700 ms → SIGKILL) to guarantee port is freed |
### Why two layers?
- **Layer 1** catches the case where the service manager itself restarts the process (e.g. after a crash or `KeepAlive`). No CLI involvement needed.
- **Layer 2** catches the case where a user runs `openclaw daemon stop` or `restart` manually, and the process doesn't die. The existing `forceFreePortAndWait()` helper (already used by `gateway start --force`) is reused here for consistency.
## Test plan
- Added `src/daemon/launchd-plist.test.ts` (4 tests)
- Extended `src/daemon/systemd-unit.test.ts` (3 new tests for `buildSystemdUnit`)
- All existing tests pass
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR hardens gateway daemon shutdown/restart behavior to prevent “zombie” processes from causing infinite restart loops.
- Service-manager layer: launchd plists now include `ExitTimeOut=15`, and systemd unit generation adds `TimeoutStopSec=15` plus restart throttling (`StartLimitIntervalSec=300`, `StartLimitBurst=10`).
- CLI layer: `daemon stop`/`daemon restart` gain a `--force` flag which uses the existing `forceFreePortAndWait()` helper to escalate from SIGTERM to SIGKILL to ensure the gateway port is freed.
- Tests: new launchd plist tests and additional systemd unit tests cover the new unit/plist fields.
Integration-wise, the systemd and launchd generators are used by `src/daemon/systemd.ts` and `src/daemon/launchd.ts` during install, so the new timeout/throttling settings will be applied to newly generated service definitions, while the CLI flag provides an explicit operator tool for stuck processes.
<h3>Confidence Score: 4/5</h3>
- This PR is generally safe to merge, with a couple of correctness/clarity issues to address first.
- Core changes are localized (unit/plist generation + optional CLI flag) and covered by new tests, but there is at least one definite typo in a key explanatory comment and the `--force` stop path can produce misleading JSON/service snapshots for consumers because it conflates “port/process killed” with “service unloaded”.
- src/cli/daemon-cli/lifecycle.ts, src/daemon/systemd-unit.ts
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#11147: fix(daemon): stop gateway by port when no daemon service is active
by jasonthewhale · 2026-02-07
84.4%
#6273: fix: handle EPIPE errors gracefully in daemon operations
by batumilove · 2026-02-01
80.8%
#16845: fix(daemon): gateway auto-restart on SIGTERM + agent restart guidel...
by kiminbean · 2026-02-15
79.2%
#23584: fix(daemon): improve gateway service detection to avoid false posit...
by mohandshamada · 2026-02-22
78.8%
#12804: fix(daemon): use wrapper script for pnpm global installs in service...
by odinho · 2026-02-09
78.4%
#8260: fix(macOS): gateway readiness detection + reversible Configure later
by xksteven · 2026-02-03
78.4%
#18236: macOS daemon: bootstrap LaunchAgent on gateway start after stop
by agisilaos · 2026-02-16
78.2%
#20629: fix: use KillMode=mixed to prevent orphaned child processes
by alexander-morris · 2026-02-19
78.1%
#6302: fix: Add timeouts to prevent indefinite hangs (issues #4954, #4956,...
by batumilove · 2026-02-01
78.1%
#4653: fix(gateway): improve crash resilience for mDNS and network errors
by AyedAlmudarra · 2026-01-30
77.9%