#23650: fix(cron): arm maintenance timer when enabled jobs lack nextRunAtMs
size: S
Cluster:
Cron Scheduler Improvements
## Problem
Fixes #23628
The cron scheduler timer can silently stop when `armTimer()` is called and `nextWakeAtMs()` returns `undefined`. This happens when enabled jobs exist but none have a valid `nextRunAtMs` value — a transient state that can occur when:
- The store is reloaded with stale/corrupt data during `forceReload`
- Schedule computation fails (but error count hasn't reached the auto-disable threshold)
- All enabled jobs have `runningAtMs` set and `nextRunAtMs` was cleared or never recomputed
When this happens, `armTimer()` returns without setting a timer, the scheduler stops, and jobs never fire until the next gateway restart.
## Root Cause
`armTimer()` checks `nextWakeAtMs(state)` which returns `undefined` when no enabled jobs have a numeric `nextRunAtMs`. The function then exits early without arming any timer:
```ts
const nextAt = nextWakeAtMs(state);
if (!nextAt) {
// logs and returns — NO timer set!
return;
}
```
This is correct when there are genuinely no jobs to schedule. But when enabled jobs exist (just temporarily lacking `nextRunAtMs`), the scheduler should keep ticking so it can self-heal on the next `onTimer()` tick when `recomputeNextRunsForMaintenance()` fills in missing values.
## Fix
When `nextWakeAtMs()` returns `undefined` but enabled jobs exist, arm a maintenance fallback timer at `MAX_TIMER_DELAY_MS` (60s). This ensures:
1. The scheduler keeps ticking even in transient states
2. `onTimer()` → `recomputeNextRunsForMaintenance()` can repair missing `nextRunAtMs` values
3. No behavior change for the normal case (jobs with valid `nextRunAtMs`)
## Tests
- Added `service.timer-maintenance-fallback.test.ts` with 2 test cases:
- Verifies maintenance timer is armed when enabled jobs lack `nextRunAtMs`
- Verifies no timer is armed when all jobs are disabled (no false positive)
- Existing timer tests continue to pass (17 tests)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Prevents silent scheduler stoppage by arming a maintenance fallback timer when enabled jobs exist but lack `nextRunAtMs` values. The fix addresses a critical bug where transient states (store reloads, schedule computation failures) could permanently halt the cron scheduler until gateway restart.
**Key changes:**
- Added fallback timer (60s) in `armTimer()` when `enabledCount > 0` but `nextWakeAtMs()` returns `undefined`
- Ensures scheduler continues ticking so `recomputeNextRunsForMaintenance()` can repair missing `nextRunAtMs` values
- Comprehensive test coverage validates both the fix and the no-false-positive case
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk
- The fix is surgical, well-reasoned, and thoroughly tested. It addresses a critical scheduler bug with a simple fallback mechanism that preserves existing behavior while preventing silent failures. The logic is sound: when enabled jobs exist but lack nextRunAtMs, arm a maintenance timer to allow self-healing. Both positive and negative test cases validate the implementation.
- No files require special attention
<sub>Last reviewed commit: 25952d8</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#12131: fix(cron): ensure timer callback fires for scheduled jobs
by divol89 · 2026-02-08
87.1%
#18191: fix(cron): prevent scheduler freeze during rapid create/delete cycl...
by BinHPdev · 2026-02-16
85.5%
#12122: fix(cron): ensure timer callback fires for scheduled jobs
by divol89 · 2026-02-08
85.3%
#10829: fix: prevent cron scheduler permanent death on transient startup/ru...
by meaadore1221-afk · 2026-02-07
85.0%
#12086: fix(cron): ensure timer callback fires for scheduled jobs
by divol89 · 2026-02-08
84.2%
#8034: fix(cron): run past-due one-shot jobs immediately on startup
by FelixFoster · 2026-02-03
82.8%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
82.7%
#19541: fix: schedule nextWakeAtMs for isolated sessionTarget cron jobs
by guirguispierre · 2026-02-17
80.4%
#16132: fix(cron): prevent duplicate job fires via MIN_REFIRE_GAP_MS guard
by widingmarcus-cyber · 2026-02-14
79.9%
#12303: fix(cron): correct nextRunAtMs calculation and prevent timer stall
by colddonkey · 2026-02-09
79.0%