#10829: fix: prevent cron scheduler permanent death on transient startup/runtime errors
stale
Cluster:
Cron Scheduler Improvements
## Problem
The cron scheduler can permanently die and never recover after transient errors (file I/O failures, disk contention during gateway restarts, etc.), causing all scheduled jobs to silently stop firing with zero indication in the logs.
This was observed in production: the scheduler went completely silent for 20+ hours despite the gateway process running normally and responding to API calls (e.g. `cron.status`, `cron.run`).
## Root Cause
Two bugs in the self-rearming timer chain (`armTimer → setTimeout → onTimer → armTimer → ...`):
### Bug 1: `start()` silent failure (PRIMARY)
`start()` in `ops.ts` places `armTimer()` inside the `locked()` callback. If `ensureLoaded()`, `runMissedJobs()`, `recomputeNextRuns()`, or `persist()` throws a transient error (e.g. file I/O during gateway restart), `armTimer()` is never called. The timer is never created. The "cron: started" log is never printed. The scheduler simply never exists — with no error trace.
**Evidence from logs:** A 20+ hour gap with no "cron: started" entries was observed, despite multiple gateway restarts during that window. The gateway was serving requests normally but the cron scheduler was completely dead.
### Bug 2: `armTimer()` gives up when store is not loaded
When `ensureLoaded()` fails in `start()`, `state.store` remains `null`. When `armTimer()` is then called, `nextWakeAtMs(state)` returns `undefined` (no store = no jobs = no next wake time), so `armTimer()` returns without setting any timer. Even with the `onTimer()` finally-block fix already in place, this means there is no recovery path — the scheduler dies permanently.
### Bug 3: `.catch` handler doesn't re-arm (minor)
The `.catch` handler on `armTimer`'s setTimeout callback only logs the error without re-arming the timer. While `onTimer()` already has `armTimer` in its `finally` block, this is a gap in the belt-and-suspenders defense.
## Fix
### 1. `ops.ts` — `start()` resilience
Move `armTimer()` and `log.info("cron: started")` **outside** the `locked()` callback. Wrap the locked block in `try/catch` so startup errors are logged but don't prevent the timer from being created. The timer's first tick will retry `ensureLoaded`, allowing the scheduler to self-heal.
### 2. `timer.ts` — `armTimer()` store retry
When `armTimer()` has no next wake time AND `state.store` is `null` (indicating the store couldn't be loaded), schedule a retry timer using `MAX_TIMER_DELAY_MS` (60s). This ensures the scheduler retries loading the store instead of silently giving up.
### 3. `timer.ts` — `.catch` handler re-arm
Add `armTimer(state)` in the `.catch` handler as a last-resort safety net, in case `onTimer()` throws past its `finally` block.
## Behavior After Fix
| Scenario | Before | After |
|---|---|---|
| `start()` throws | Timer never created, scheduler permanently dead, no error log | Error logged, timer still arms, next tick retries |
| `ensureLoaded` fails (store=null) | `armTimer` returns silently, no timer | `armTimer` schedules 60s retry, `onTimer` retries load |
| `onTimer` throws past finally | `.catch` logs only, timer chain dead | `.catch` re-arms timer |
## Test plan
- [ ] Verify `pnpm build` passes
- [ ] Verify `pnpm test` passes
- [ ] Manual: confirm cron jobs fire on schedule after gateway restart
- [ ] Manual: confirm "cron: started" always appears in logs after gateway boot
Made with [Cursor](https://cursor.com)
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
- Makes cron scheduler startup/timer loop resilient to transient failures by ensuring `armTimer()` is called even when initialization throws.
- Adds a retry timer when the cron store isn’t loaded (store is `null`), preventing the scheduler from going permanently idle.
- Adds a belt-and-suspenders re-arm in the timer tick error handler so the timer chain can recover from unexpected exceptions.
<h3>Confidence Score: 4/5</h3>
- This PR is likely safe to merge, with one logging issue that could hinder diagnosing the very failures it targets.
- Behavioral changes are localized to cron startup and timer arming logic and are consistent with the stated failure modes. The main concern is the new `start()` catch logging stringification of errors, which drops stack traces/structured info and reduces observability during transient I/O failures.
- src/cron/service/ops.ts (startup error logging)
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#12086: fix(cron): ensure timer callback fires for scheduled jobs
by divol89 · 2026-02-08
87.5%
#12122: fix(cron): ensure timer callback fires for scheduled jobs
by divol89 · 2026-02-08
86.6%
#14430: Cron: anti-zombie scheduler recovery and in-flight job persistence
by philga7 · 2026-02-12
85.4%
#23650: fix(cron): arm maintenance timer when enabled jobs lack nextRunAtMs
by taw0002 · 2026-02-22
85.0%
#12131: fix(cron): ensure timer callback fires for scheduled jobs
by divol89 · 2026-02-08
84.8%
#12303: fix(cron): correct nextRunAtMs calculation and prevent timer stall
by colddonkey · 2026-02-09
84.8%
#16888: fix(cron): execute missed jobs outside the lock to unblock list/sta...
by hou-rong · 2026-02-15
84.3%
#18191: fix(cron): prevent scheduler freeze during rapid create/delete cycl...
by BinHPdev · 2026-02-16
84.1%
#8034: fix(cron): run past-due one-shot jobs immediately on startup
by FelixFoster · 2026-02-03
84.0%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
83.5%