← Back to PRs

#17702: feat: crash-loop detection and last-known-good config rollback

by aronchick open 2026-02-16 03:16 View on GitHub →
gateway size: M
## Summary Phase 1 implementation of atomic config management ([#17700](https://github.com/openclaw/openclaw/issues/17700)). ### Motivation When AI agents modify `openclaw.json` with invalid values (unresolvable env vars, bad API keys), the gateway enters a crash loop with no automated recovery. This happened on 2026-02-16 when a sub-agent replaced API keys with `${ENV_VAR}` references that didn't exist in the systemd environment. ### What this adds **`src/config/crash-tracker.ts`** — New module with: - **Crash-loop detection**: Tracks startup timestamps in `crash-tracker.json`. If >3 startups occur within 60s, flags a crash loop. - **Last-known-good management**: `saveLastKnownGood()` copies current config after healthy startup (10s). `revertToLastKnownGood()` restores it and saves the failed config for debugging. - **`scheduleLastKnownGoodSave()`**: Timer-based hook — call on gateway startup, it saves LKG after the health window passes. **`src/config/crash-tracker.test.ts`** — Full test coverage for all functions. ### Integration points (not in this PR) The gateway startup code needs to: 1. Call `recordStartupAndCheckCrashLoop()` on start 2. If crash loop detected + LKG exists → call `revertToLastKnownGood()` and log 3. Call `scheduleLastKnownGoodSave()` after successful init This PR provides the building blocks. Integration into the gateway startup flow can be a follow-up. ### How crash-loop revert works ``` Start → record timestamp → 3+ in 60s? → YES → revert to LKG → restart → NO → continue boot → healthy 10s → save as LKG ``` Closes #17700 (Phase 1) <!-- greptile_comment --> <h3>Greptile Summary</h3> Phase 1 of atomic config management: adds crash-loop detection and last-known-good config rollback as a standalone module in `src/config/crash-tracker.ts` with colocated tests. The module tracks startup timestamps, detects rapid-restart loops (>=3 within 60s), and can revert to a previously saved good config while preserving the failed config for debugging. - The core crash-loop detection and LKG save/revert logic is sound and well-tested. - `scheduleLastKnownGoodSave` unconditionally clears crash tracker even when the LKG save fails, which can silently disable crash-loop protection on subsequent restarts. - Integration into the gateway startup flow is deferred to a follow-up PR as noted in the description. <h3>Confidence Score: 3/5</h3> - New standalone module with no existing code changes; safe to merge with the noted logic fix in `scheduleLastKnownGoodSave`. - Score reflects one logic bug in `scheduleLastKnownGoodSave` that could silently disable crash-loop protection when the LKG save fails. The module is self-contained with no changes to existing code, limiting blast radius. Core detection and revert logic is correct and well-tested. - `src/config/crash-tracker.ts` — the `scheduleLastKnownGoodSave` function needs the crash-tracker clear to be conditional on a successful LKG save. <sub>Last reviewed commit: 7e4761c</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs