#21944: feat(gateway): crash-loop protection with escalating backoff

by Protocol-zero-0 open 2026-02-20 15:20 View on GitHub →

docs cli size: M

## TL;DR for maintainers Adds built-in restart rate-limiting so the gateway stops burning CPU/disk when it hits a fatal error that can't self-heal. Escalating backoff (0 → 30 s → 5 min → hard stop) based on crash frequency in a 15-minute sliding window. Zero behavioural change for healthy startups. Closes #16810 --- ## Problem When the gateway process crashes, there is no rate limiting on restart attempts. A single bad flag in `NODE_OPTIONS` or an invalid config entry causes an unbounded crash loop — 18 restarts in 5 minutes was observed in production, saturating CPU and disk I/O and taking down all channels (Telegram, Discord, WhatsApp) for 30+ minutes. System-level supervisors (`launchd`, `systemd`) handle process restarts but cannot diagnose or fix the root cause — they just restart into the same crash. ## Solution A new self-contained module `src/cli/gateway-cli/crash-loop-guard.ts` that: 1. **Records** crash timestamps to `<stateDir>/gateway-crash-history.json` via a synchronous `process.on('exit')` handler (safe for exit hooks). 2. **Checks** crash history at the very start of `runGatewayCommand` and applies an escalating backoff schedule: | Recent crashes (15 min window) | Action | |---|----| | 1–3 | Immediate restart (current behaviour) | | 4–6 | 30-second delay before starting | | 7–9 | 5-minute delay before starting | | 10+ | Refuse to start (`CrashLoopError`) | 3. **Clears** the crash history once `startGatewayServer` completes successfully, so normal restarts (SIGUSR1, config reload) are never penalised. ### Files changed | File | Change | |---|---| | `src/cli/gateway-cli/crash-loop-guard.ts` | New module: `recordGatewayCrash`, `clearGatewayCrashHistory`, `applyCrashLoopGuard`, `CrashLoopError` | | `src/cli/gateway-cli/crash-loop-guard.test.ts` | 8 unit tests covering all tiers + edge cases | | `src/cli/gateway-cli/run.ts` | Integrate guard at startup + exit hook + clear on success | ## What's NOT in this PR - Notification via messaging channel on crash loop (could be a follow-up) - Systemd unit file template (docs-only, separate PR) ## Testing - 8 new unit tests: empty history, below threshold, 30 s tier, 5 min tier, hard stop (CrashLoopError), corrupt file recovery - Existing run-loop tests pass (4/4) ## AI disclosure This change was AI-assisted (research + implementation). All code was manually reviewed and tested by the author. Made with [Cursor](https://cursor.com)  <h3>Greptile Summary</h3> Adds escalating crash-loop protection for the gateway process to prevent unbounded restart attempts when hitting fatal errors. Tracks crash history in a 15-minute sliding window and applies graduated backoff (30s → 5min → hard stop) to protect against CPU/disk saturation. **Key changes:** - New `crash-loop-guard.ts` module with `recordGatewayCrash`, `clearGatewayCrashHistory`, and `applyCrashLoopGuard` - Integration in `run.ts`: guard check at startup, exit handler to record crashes, clear history on successful server start - 8 unit tests covering all backoff tiers, corrupt file recovery, and edge cases **Issues found:** - Exit handler registered before validation checks, causing config errors to be recorded as crashes (see inline comment) <h3>Confidence Score: 3/5</h3> - This PR is mostly safe but has a logic issue that needs fixing before merge - The implementation is well-structured with good test coverage, but the exit handler placement causes validation errors to be incorrectly counted as crashes. This could lead to false crash-loop detection after config mistakes. The core crash-loop protection logic is sound, file I/O is safely handled, and the backoff tiers are reasonable. Fix the exit handler timing and this becomes a solid PR. - Pay close attention to `src/cli/gateway-cli/run.ts` - the exit handler needs to be moved after validation checks <sub>Last reviewed commit: 123ebcd</sub>