#17702: feat: crash-loop detection and last-known-good config rollback
gateway
size: M
Cluster:
Gateway and macOS Improvements
## Summary
Phase 1 implementation of atomic config management ([#17700](https://github.com/openclaw/openclaw/issues/17700)).
### Motivation
When AI agents modify `openclaw.json` with invalid values (unresolvable env vars, bad API keys), the gateway enters a crash loop with no automated recovery. This happened on 2026-02-16 when a sub-agent replaced API keys with `${ENV_VAR}` references that didn't exist in the systemd environment.
### What this adds
**`src/config/crash-tracker.ts`** — New module with:
- **Crash-loop detection**: Tracks startup timestamps in `crash-tracker.json`. If >3 startups occur within 60s, flags a crash loop.
- **Last-known-good management**: `saveLastKnownGood()` copies current config after healthy startup (10s). `revertToLastKnownGood()` restores it and saves the failed config for debugging.
- **`scheduleLastKnownGoodSave()`**: Timer-based hook — call on gateway startup, it saves LKG after the health window passes.
**`src/config/crash-tracker.test.ts`** — Full test coverage for all functions.
### Integration points (not in this PR)
The gateway startup code needs to:
1. Call `recordStartupAndCheckCrashLoop()` on start
2. If crash loop detected + LKG exists → call `revertToLastKnownGood()` and log
3. Call `scheduleLastKnownGoodSave()` after successful init
This PR provides the building blocks. Integration into the gateway startup flow can be a follow-up.
### How crash-loop revert works
```
Start → record timestamp → 3+ in 60s? → YES → revert to LKG → restart
→ NO → continue boot
→ healthy 10s → save as LKG
```
Closes #17700 (Phase 1)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Phase 1 of atomic config management: adds crash-loop detection and last-known-good config rollback as a standalone module in `src/config/crash-tracker.ts` with colocated tests. The module tracks startup timestamps, detects rapid-restart loops (>=3 within 60s), and can revert to a previously saved good config while preserving the failed config for debugging.
- The core crash-loop detection and LKG save/revert logic is sound and well-tested.
- `scheduleLastKnownGoodSave` unconditionally clears crash tracker even when the LKG save fails, which can silently disable crash-loop protection on subsequent restarts.
- Integration into the gateway startup flow is deferred to a follow-up PR as noted in the description.
<h3>Confidence Score: 3/5</h3>
- New standalone module with no existing code changes; safe to merge with the noted logic fix in `scheduleLastKnownGoodSave`.
- Score reflects one logic bug in `scheduleLastKnownGoodSave` that could silently disable crash-loop protection when the LKG save fails. The module is self-contained with no changes to existing code, limiting blast radius. Core detection and revert logic is correct and well-tested.
- `src/config/crash-tracker.ts` — the `scheduleLastKnownGoodSave` function needs the crash-tracker clear to be conditional on a successful LKG save.
<sub>Last reviewed commit: 7e4761c</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#21944: feat(gateway): crash-loop protection with escalating backoff
by Protocol-zero-0 · 2026-02-20
83.8%
#21931: feat(config): auto-rollback to last known-good backup on invalid co...
by Protocol-zero-0 · 2026-02-20
81.7%
#12234: gateway: incident tracking, recover command, and ciao ERR_SERVER_CL...
by levineam · 2026-02-09
78.0%
#11455: fix(gateway): default gateway.mode to local when unset
by AnonO6 · 2026-02-07
74.5%
#5823: fix(config): exit cleanly on invalid config instead of high CPU loop
by gavinbmoore · 2026-02-01
73.6%
#22424: fix: prevent crash when onUpdate is truthy but not callable (fixes ...
by mcaxtr · 2026-02-21
73.5%
#14313: feat: Atomic OpenClaw Configuration Management
by aronchick · 2026-02-11
73.2%
#21994: Config: load valid backup when primary config is invalid
by islavutin · 2026-02-20
73.0%
#15050: fix: transcript corruption resilience — strip aborted tool_use bloc...
by yashchitneni · 2026-02-12
73.0%
#14564: fix(gateway): crashes on startup when tailscale meets non-loopback ...
by yinghaosang · 2026-02-12
72.9%