← Back to PRs

#20555: fix(gateway): detect launchd supervision via XPC_SERVICE_NAME

by dimat open 2026-02-19 02:55 View on GitHub →
size: XS
## Summary - Problem: On macOS, when the gateway receives SIGUSR1 (config reload, update), it spawns a detached child process instead of letting launchd handle the restart. This creates a duplicate gateway that fights with launchd's `KeepAlive` respawn, producing thousands of lock-timeout errors every ~10 seconds. - Why it matters: The error log fills with 8000+ failures ("gateway already running; lock timeout after 5000ms"), launchd shows 1198 runs, and the gateway restart is unreliable under launchd supervision. - What changed: Added `XPC_SERVICE_NAME`, `OPENCLAW_LAUNCHD_LABEL`, and `OPENCLAW_SYSTEMD_UNIT` to the `SUPERVISOR_HINT_ENV_VARS` list in `process-respawn.ts`. On macOS, launchd sets `XPC_SERVICE_NAME` on every managed process (confirmed via `launchctl print`) but does **not** set `LAUNCH_JOB_LABEL` or `LAUNCH_JOB_NAME`. Without this check, `isLikelySupervisedProcess()` returns `false`, and the gateway forks a detached child via `spawn()` instead of returning `"supervised"`. Additionally, `OPENCLAW_LAUNCHD_LABEL` and `OPENCLAW_SYSTEMD_UNIT` are OpenClaw-propagated env vars from daemon/service env flows (`src/daemon/service-env.ts`, `src/daemon/node-service.ts`, restart helpers), serving as reliable fallback signals when platform-native vars are absent in some launch paths. - What did NOT change (scope boundary): No changes to the lock mechanism, launchd plist generation, or restart-helper scripts. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Related: similar launchd environment issue as #20512 ## User-visible / Behavior Changes - Gateway SIGUSR1 restarts under launchd no longer spawn a duplicate detached process. launchd cleanly restarts the single managed process. - Eliminates the "gateway already running; lock timeout" error storm in logs. ## Security Impact (required) - New permissions/capabilities? `No` - Secrets/tokens handling changed? `No` - New/changed network calls? `No` - Command/tool execution surface changed? `No` - Data access scope changed? `No` ## Repro + Verification ### Environment - OS: macOS 15 (Darwin 25.2.0) - Runtime/container: Node 25.6.1 via Homebrew - Model/provider: N/A - Integration/channel (if any): N/A - Relevant config (redacted): gateway managed via `ai.openclaw.gateway` LaunchAgent with `KeepAlive=true` ### Steps 1. Install gateway as LaunchAgent (`openclaw gateway install`) 2. Trigger a config change or update that sends SIGUSR1 to the gateway 3. Observe gateway error log (`~/.openclaw/logs/gateway.err.log`) ### Expected - Gateway restarts cleanly via launchd; no lock errors. ### Actual - Gateway spawns a detached child (holds lock + port), then launchd also restarts the managed process → lock timeout → exit(1) → launchd restarts again → infinite loop every ~10s. Observed error log: ``` Gateway failed to start: gateway already running (pid 85206); lock timeout after 5000ms Port 18789 is already in use. - pid 85206 dmitry: openclaw-gateway (127.0.0.1:18789) Gateway service appears loaded. Stop it first. ``` ## Evidence - [x] Failing test/log before + passing after - [x] Trace/log snippets `launchctl print` confirms `XPC_SERVICE_NAME` is set but `LAUNCH_JOB_LABEL` is not: ``` environment = { XPC_SERVICE_NAME => ai.openclaw.gateway } ``` Gateway stdout log shows the spawn-based restart path was taken: ``` [gateway] restart mode: full process restart (spawned pid 67308) ``` After this fix, the restart returns `mode: "supervised"` and launchd handles it. ## Human Verification (required) - Verified scenarios: - All 7 process-respawn tests pass - All 15 restart-helper tests pass - Confirmed via `launchctl print` that `XPC_SERVICE_NAME` is set on the live launchd-managed gateway - Edge cases checked: - `clearSupervisorHints()` test helper updated to also clear `XPC_SERVICE_NAME`, `OPENCLAW_LAUNCHD_LABEL`, and `OPENCLAW_SYSTEMD_UNIT` - Existing `LAUNCH_JOB_LABEL` detection path unchanged - What you did **not** verify: - End-to-end SIGUSR1 restart under launchd with the fix deployed (manual verification TODO) ## Compatibility / Migration - Backward compatible? `Yes` - Config/env changes? `No` - Migration needed? `No` ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: Revert the two commits on this branch - Files/config to restore: `src/infra/process-respawn.ts` - Known bad symptoms reviewers should watch for: If a non-launchd macOS process happens to have `XPC_SERVICE_NAME` or `OPENCLAW_LAUNCHD_LABEL` set, it would incorrectly return `"supervised"` instead of spawning a child. In practice this is extremely unlikely outside of supervised contexts. ## Risks and Mitigations - Risk: `XPC_SERVICE_NAME` could theoretically be set in non-launchd contexts (e.g. XPC services embedded in apps). - Mitigation: The variable name is specific to Apple's XPC/launchd infrastructure. Any process with it set is effectively supervised. The existing `LAUNCH_JOB_LABEL` check has the same theoretical concern. - Risk: `OPENCLAW_LAUNCHD_LABEL` / `OPENCLAW_SYSTEMD_UNIT` could be set manually by a user outside of a supervised context. - Mitigation: These are internal OpenClaw env vars only propagated by daemon/service env flows. A user would have to explicitly set them, which would be an intentional override. 🤖 Generated with [Claude Code](https://claude.com/claude-code)

Most Similar PRs