#21944: feat(gateway): crash-loop protection with escalating backoff
docs
cli
size: M
Cluster:
Gateway and macOS Improvements
## TL;DR for maintainers
Adds built-in restart rate-limiting so the gateway stops burning CPU/disk
when it hits a fatal error that can't self-heal. Escalating backoff
(0 → 30 s → 5 min → hard stop) based on crash frequency in a 15-minute
sliding window. Zero behavioural change for healthy startups.
Closes #16810
---
## Problem
When the gateway process crashes, there is no rate limiting on restart
attempts. A single bad flag in `NODE_OPTIONS` or an invalid config entry
causes an unbounded crash loop — 18 restarts in 5 minutes was observed
in production, saturating CPU and disk I/O and taking down all channels
(Telegram, Discord, WhatsApp) for 30+ minutes.
System-level supervisors (`launchd`, `systemd`) handle process restarts
but cannot diagnose or fix the root cause — they just restart into the
same crash.
## Solution
A new self-contained module `src/cli/gateway-cli/crash-loop-guard.ts`
that:
1. **Records** crash timestamps to `<stateDir>/gateway-crash-history.json`
via a synchronous `process.on('exit')` handler (safe for exit hooks).
2. **Checks** crash history at the very start of `runGatewayCommand` and
applies an escalating backoff schedule:
| Recent crashes (15 min window) | Action |
|---|----|
| 1–3 | Immediate restart (current behaviour) |
| 4–6 | 30-second delay before starting |
| 7–9 | 5-minute delay before starting |
| 10+ | Refuse to start (`CrashLoopError`) |
3. **Clears** the crash history once `startGatewayServer` completes
successfully, so normal restarts (SIGUSR1, config reload) are never
penalised.
### Files changed
| File | Change |
|---|---|
| `src/cli/gateway-cli/crash-loop-guard.ts` | New module: `recordGatewayCrash`, `clearGatewayCrashHistory`, `applyCrashLoopGuard`, `CrashLoopError` |
| `src/cli/gateway-cli/crash-loop-guard.test.ts` | 8 unit tests covering all tiers + edge cases |
| `src/cli/gateway-cli/run.ts` | Integrate guard at startup + exit hook + clear on success |
## What's NOT in this PR
- Notification via messaging channel on crash loop (could be a follow-up)
- Systemd unit file template (docs-only, separate PR)
## Testing
- 8 new unit tests: empty history, below threshold, 30 s tier, 5 min tier,
hard stop (CrashLoopError), corrupt file recovery
- Existing run-loop tests pass (4/4)
## AI disclosure
This change was AI-assisted (research + implementation). All code was
manually reviewed and tested by the author.
Made with [Cursor](https://cursor.com)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds escalating crash-loop protection for the gateway process to prevent unbounded restart attempts when hitting fatal errors. Tracks crash history in a 15-minute sliding window and applies graduated backoff (30s → 5min → hard stop) to protect against CPU/disk saturation.
**Key changes:**
- New `crash-loop-guard.ts` module with `recordGatewayCrash`, `clearGatewayCrashHistory`, and `applyCrashLoopGuard`
- Integration in `run.ts`: guard check at startup, exit handler to record crashes, clear history on successful server start
- 8 unit tests covering all backoff tiers, corrupt file recovery, and edge cases
**Issues found:**
- Exit handler registered before validation checks, causing config errors to be recorded as crashes (see inline comment)
<h3>Confidence Score: 3/5</h3>
- This PR is mostly safe but has a logic issue that needs fixing before merge
- The implementation is well-structured with good test coverage, but the exit handler placement causes validation errors to be incorrectly counted as crashes. This could lead to false crash-loop detection after config mistakes. The core crash-loop protection logic is sound, file I/O is safely handled, and the backoff tiers are reasonable. Fix the exit handler timing and this becomes a solid PR.
- Pay close attention to `src/cli/gateway-cli/run.ts` - the exit handler needs to be moved after validation checks
<sub>Last reviewed commit: 123ebcd</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#17702: feat: crash-loop detection and last-known-good config rollback
by aronchick · 2026-02-16
83.8%
#12234: gateway: incident tracking, recover command, and ciao ERR_SERVER_CL...
by levineam · 2026-02-09
81.0%
#21931: feat(config): auto-rollback to last known-good backup on invalid co...
by Protocol-zero-0 · 2026-02-20
80.1%
#4653: fix(gateway): improve crash resilience for mDNS and network errors
by AyedAlmudarra · 2026-01-30
77.7%
#14564: fix(gateway): crashes on startup when tailscale meets non-loopback ...
by yinghaosang · 2026-02-12
77.2%
#22424: fix: prevent crash when onUpdate is truthy but not callable (fixes ...
by mcaxtr · 2026-02-21
77.0%
#17758: Fix crash on transient Discord gateway zombie connection errors
by DoyoDia · 2026-02-16
76.0%
#13084: fix(daemon): multi-layer defense against zombie gateway processes
by openperf · 2026-02-10
75.9%
#23364: Gateway: add risk-ack interlock for dangerous Control UI flags
by bmendonca3 · 2026-02-22
75.2%
#21529: Gateway: allow node health and throttle repeated unauthorized role ...
by doomsday616 · 2026-02-20
75.1%