← Back to PRs

#12234: gateway: incident tracking, recover command, and ciao ERR_SERVER_CLOSED hardening

by levineam open 2026-02-09 01:46 View on GitHub →
docs cli stale
## Problem On supervised setups (launchd/systemd), the gateway can enter fast restart churn (SIGTERM restarts, crash→recover cycles). Two gaps show up operationally: 1) **No durable incident trail** for signals/crashes/recovery attempts (hard to correlate with cron/config activity). 2) `@homebridge/ciao` can throw an **uncaught** error during teardown: ``` ERR_SERVER_CLOSED: Cannot send packets on a closed mdns server! ``` …which turns a graceful stop into a crash/restart loop. ## Changes ### 1) Native gateway incident tracking (JSONL + summary state) - Append-only incident log: `~/.openclaw/state/gateway-incidents.jsonl` - Summary state: `~/.openclaw/state/gateway-incidents-state.json` - Records: `start`, `signal`, `crash`, `recover` ### 2) New CLI surfaces - `openclaw gateway incidents` (human + `--json`) - `openclaw gateway recover` (best-effort, non-destructive by default) - includes cooldown (`--cooldown-ms`) to avoid restart loops - records recovery attempts to the incident log ### 3) mDNS hardening - Suppress ciao `ERR_SERVER_CLOSED` as a non-fatal uncaught exception (warn + continue) - prevents crash/restart loops during shutdown/interface churn ### 4) Cron UX - `openclaw cron list` prints a note when disabled jobs are hidden (use `--all`) ### 5) Alert plumbing (minimal) - Extend restart sentinel kinds with `crash` + `recover` and add a sync writer for crash paths - enables post-restart wake/notification flows to include crash/recover events ## Tests Ran locally: ```bash pnpm vitest run \ src/infra/gateway-incidents.test.ts \ src/infra/bonjour-uncaught.test.ts \ src/cli/gateway.sigterm.test.ts ``` ## Docs - `docs/incident-hardening.md` - `docs/incident-model.md` ## Notes / compatibility - New files/fields are additive; no protocol changes required. - `ERR_SERVER_CLOSED` suppression is narrowly targeted (code/message match). <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR adds a durable gateway incident trail (JSONL log + summary state under the state dir), wires synchronous incident recording into the gateway run loop (start/signal/crash), and adds two new CLI commands: `openclaw gateway incidents` (reads and prints the log) and `openclaw gateway recover` (best-effort service restart with cooldown and sentinel/incident recording). It also hardens shutdown by suppressing a specific `@homebridge/ciao` `ERR_SERVER_CLOSED` uncaught exception and adds tests covering the new incident logging and suppression behavior. Key integration points are the global `process.on("uncaughtException")` handler in `src/index.ts`, the gateway lifecycle hooks in `src/cli/gateway-cli/run-loop.ts`, and the new persistence layer in `src/infra/gateway-incidents.ts`. <h3>Confidence Score: 3/5</h3> - This PR is close, but has a few correctness/runtime issues to address before merging. - Core design looks consistent and tests cover the new incident primitives, but there are (1) a definite bug in `gateway recover` always calling `restart`, (2) a likely runtime-compat issue using `toReversed()`, and (3) the uncaught-exception gateway detection is broad enough to misclassify CLI crashes as gateway crashes, impacting incident/sentinel outputs. - src/cli/gateway-cli/register.ts, src/infra/gateway-incidents.ts, src/index.ts <!-- greptile_other_comments_section --> <sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub> <!-- /greptile_comment -->

Most Similar PRs