#10829: fix: prevent cron scheduler permanent death on transient startup/runtime errors

by meaadore1221-afk open 2026-02-07 02:28 View on GitHub →

stale

## Problem The cron scheduler can permanently die and never recover after transient errors (file I/O failures, disk contention during gateway restarts, etc.), causing all scheduled jobs to silently stop firing with zero indication in the logs. This was observed in production: the scheduler went completely silent for 20+ hours despite the gateway process running normally and responding to API calls (e.g. `cron.status`, `cron.run`). ## Root Cause Two bugs in the self-rearming timer chain (`armTimer → setTimeout → onTimer → armTimer → ...`): ### Bug 1: `start()` silent failure (PRIMARY) `start()` in `ops.ts` places `armTimer()` inside the `locked()` callback. If `ensureLoaded()`, `runMissedJobs()`, `recomputeNextRuns()`, or `persist()` throws a transient error (e.g. file I/O during gateway restart), `armTimer()` is never called. The timer is never created. The "cron: started" log is never printed. The scheduler simply never exists — with no error trace. **Evidence from logs:** A 20+ hour gap with no "cron: started" entries was observed, despite multiple gateway restarts during that window. The gateway was serving requests normally but the cron scheduler was completely dead. ### Bug 2: `armTimer()` gives up when store is not loaded When `ensureLoaded()` fails in `start()`, `state.store` remains `null`. When `armTimer()` is then called, `nextWakeAtMs(state)` returns `undefined` (no store = no jobs = no next wake time), so `armTimer()` returns without setting any timer. Even with the `onTimer()` finally-block fix already in place, this means there is no recovery path — the scheduler dies permanently. ### Bug 3: `.catch` handler doesn't re-arm (minor) The `.catch` handler on `armTimer`'s setTimeout callback only logs the error without re-arming the timer. While `onTimer()` already has `armTimer` in its `finally` block, this is a gap in the belt-and-suspenders defense. ## Fix ### 1. `ops.ts` — `start()` resilience Move `armTimer()` and `log.info("cron: started")` **outside** the `locked()` callback. Wrap the locked block in `try/catch` so startup errors are logged but don't prevent the timer from being created. The timer's first tick will retry `ensureLoaded`, allowing the scheduler to self-heal. ### 2. `timer.ts` — `armTimer()` store retry When `armTimer()` has no next wake time AND `state.store` is `null` (indicating the store couldn't be loaded), schedule a retry timer using `MAX_TIMER_DELAY_MS` (60s). This ensures the scheduler retries loading the store instead of silently giving up. ### 3. `timer.ts` — `.catch` handler re-arm Add `armTimer(state)` in the `.catch` handler as a last-resort safety net, in case `onTimer()` throws past its `finally` block. ## Behavior After Fix | Scenario | Before | After | |---|---|---| | `start()` throws | Timer never created, scheduler permanently dead, no error log | Error logged, timer still arms, next tick retries | | `ensureLoaded` fails (store=null) | `armTimer` returns silently, no timer | `armTimer` schedules 60s retry, `onTimer` retries load | | `onTimer` throws past finally | `.catch` logs only, timer chain dead | `.catch` re-arms timer | ## Test plan - [ ] Verify `pnpm build` passes - [ ] Verify `pnpm test` passes - [ ] Manual: confirm cron jobs fire on schedule after gateway restart - [ ] Manual: confirm "cron: started" always appears in logs after gateway boot Made with [Cursor](https://cursor.com)  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> - Makes cron scheduler startup/timer loop resilient to transient failures by ensuring `armTimer()` is called even when initialization throws. - Adds a retry timer when the cron store isn’t loaded (store is `null`), preventing the scheduler from going permanently idle. - Adds a belt-and-suspenders re-arm in the timer tick error handler so the timer chain can recover from unexpected exceptions. <h3>Confidence Score: 4/5</h3> - This PR is likely safe to merge, with one logging issue that could hinder diagnosing the very failures it targets. - Behavioral changes are localized to cron startup and timer arming logic and are consistent with the stated failure modes. The main concern is the new `start()` catch logging stringification of errors, which drops stack traces/structured info and reduces observability during transient I/O failures. - src/cron/service/ops.ts (startup error logging)  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>