#12234: gateway: incident tracking, recover command, and ciao ERR_SERVER_CLOSED hardening
docs
cli
stale
Cluster:
Gateway and macOS Improvements
## Problem
On supervised setups (launchd/systemd), the gateway can enter fast restart churn (SIGTERM restarts, crash→recover cycles). Two gaps show up operationally:
1) **No durable incident trail** for signals/crashes/recovery attempts (hard to correlate with cron/config activity).
2) `@homebridge/ciao` can throw an **uncaught** error during teardown:
```
ERR_SERVER_CLOSED: Cannot send packets on a closed mdns server!
```
…which turns a graceful stop into a crash/restart loop.
## Changes
### 1) Native gateway incident tracking (JSONL + summary state)
- Append-only incident log: `~/.openclaw/state/gateway-incidents.jsonl`
- Summary state: `~/.openclaw/state/gateway-incidents-state.json`
- Records: `start`, `signal`, `crash`, `recover`
### 2) New CLI surfaces
- `openclaw gateway incidents` (human + `--json`)
- `openclaw gateway recover` (best-effort, non-destructive by default)
- includes cooldown (`--cooldown-ms`) to avoid restart loops
- records recovery attempts to the incident log
### 3) mDNS hardening
- Suppress ciao `ERR_SERVER_CLOSED` as a non-fatal uncaught exception (warn + continue)
- prevents crash/restart loops during shutdown/interface churn
### 4) Cron UX
- `openclaw cron list` prints a note when disabled jobs are hidden (use `--all`)
### 5) Alert plumbing (minimal)
- Extend restart sentinel kinds with `crash` + `recover` and add a sync writer for crash paths
- enables post-restart wake/notification flows to include crash/recover events
## Tests
Ran locally:
```bash
pnpm vitest run \
src/infra/gateway-incidents.test.ts \
src/infra/bonjour-uncaught.test.ts \
src/cli/gateway.sigterm.test.ts
```
## Docs
- `docs/incident-hardening.md`
- `docs/incident-model.md`
## Notes / compatibility
- New files/fields are additive; no protocol changes required.
- `ERR_SERVER_CLOSED` suppression is narrowly targeted (code/message match).
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR adds a durable gateway incident trail (JSONL log + summary state under the state dir), wires synchronous incident recording into the gateway run loop (start/signal/crash), and adds two new CLI commands: `openclaw gateway incidents` (reads and prints the log) and `openclaw gateway recover` (best-effort service restart with cooldown and sentinel/incident recording). It also hardens shutdown by suppressing a specific `@homebridge/ciao` `ERR_SERVER_CLOSED` uncaught exception and adds tests covering the new incident logging and suppression behavior.
Key integration points are the global `process.on("uncaughtException")` handler in `src/index.ts`, the gateway lifecycle hooks in `src/cli/gateway-cli/run-loop.ts`, and the new persistence layer in `src/infra/gateway-incidents.ts`.
<h3>Confidence Score: 3/5</h3>
- This PR is close, but has a few correctness/runtime issues to address before merging.
- Core design looks consistent and tests cover the new incident primitives, but there are (1) a definite bug in `gateway recover` always calling `restart`, (2) a likely runtime-compat issue using `toReversed()`, and (3) the uncaught-exception gateway detection is broad enough to misclassify CLI crashes as gateway crashes, impacting incident/sentinel outputs.
- src/cli/gateway-cli/register.ts, src/infra/gateway-incidents.ts, src/index.ts
<!-- greptile_other_comments_section -->
<sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#4653: fix(gateway): improve crash resilience for mDNS and network errors
by AyedAlmudarra · 2026-01-30
82.5%
#21944: feat(gateway): crash-loop protection with escalating backoff
by Protocol-zero-0 · 2026-02-20
81.0%
#11455: fix(gateway): default gateway.mode to local when unset
by AnonO6 · 2026-02-07
80.4%
#8713: feat: gateway memory monitor, install linger, docs and failover
by quratus · 2026-02-04
79.9%
#8260: fix(macOS): gateway readiness detection + reversible Configure later
by xksteven · 2026-02-03
79.5%
#10034: Don't crash gateway on transient unhandled fetch failures
by gigq · 2026-02-06
78.8%
#7128: feat: add gateway.restart RPC for graceful in-process restart
by AkashaBot · 2026-02-02
78.3%
#17702: feat: crash-loop detection and last-known-good config rollback
by aronchick · 2026-02-16
78.0%
#11788: feat: inter-agent communication via CLI scripts
by jingkang0822 · 2026-02-08
77.9%
#23364: Gateway: add risk-ack interlock for dangerous Control UI flags
by bmendonca3 · 2026-02-22
77.8%