← Back to PRs

#23650: fix(cron): arm maintenance timer when enabled jobs lack nextRunAtMs

by taw0002 open 2026-02-22 15:08 View on GitHub →
size: S
## Problem Fixes #23628 The cron scheduler timer can silently stop when `armTimer()` is called and `nextWakeAtMs()` returns `undefined`. This happens when enabled jobs exist but none have a valid `nextRunAtMs` value — a transient state that can occur when: - The store is reloaded with stale/corrupt data during `forceReload` - Schedule computation fails (but error count hasn't reached the auto-disable threshold) - All enabled jobs have `runningAtMs` set and `nextRunAtMs` was cleared or never recomputed When this happens, `armTimer()` returns without setting a timer, the scheduler stops, and jobs never fire until the next gateway restart. ## Root Cause `armTimer()` checks `nextWakeAtMs(state)` which returns `undefined` when no enabled jobs have a numeric `nextRunAtMs`. The function then exits early without arming any timer: ```ts const nextAt = nextWakeAtMs(state); if (!nextAt) { // logs and returns — NO timer set! return; } ``` This is correct when there are genuinely no jobs to schedule. But when enabled jobs exist (just temporarily lacking `nextRunAtMs`), the scheduler should keep ticking so it can self-heal on the next `onTimer()` tick when `recomputeNextRunsForMaintenance()` fills in missing values. ## Fix When `nextWakeAtMs()` returns `undefined` but enabled jobs exist, arm a maintenance fallback timer at `MAX_TIMER_DELAY_MS` (60s). This ensures: 1. The scheduler keeps ticking even in transient states 2. `onTimer()` → `recomputeNextRunsForMaintenance()` can repair missing `nextRunAtMs` values 3. No behavior change for the normal case (jobs with valid `nextRunAtMs`) ## Tests - Added `service.timer-maintenance-fallback.test.ts` with 2 test cases: - Verifies maintenance timer is armed when enabled jobs lack `nextRunAtMs` - Verifies no timer is armed when all jobs are disabled (no false positive) - Existing timer tests continue to pass (17 tests) <!-- greptile_comment --> <h3>Greptile Summary</h3> Prevents silent scheduler stoppage by arming a maintenance fallback timer when enabled jobs exist but lack `nextRunAtMs` values. The fix addresses a critical bug where transient states (store reloads, schedule computation failures) could permanently halt the cron scheduler until gateway restart. **Key changes:** - Added fallback timer (60s) in `armTimer()` when `enabledCount > 0` but `nextWakeAtMs()` returns `undefined` - Ensures scheduler continues ticking so `recomputeNextRunsForMaintenance()` can repair missing `nextRunAtMs` values - Comprehensive test coverage validates both the fix and the no-false-positive case <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk - The fix is surgical, well-reasoned, and thoroughly tested. It addresses a critical scheduler bug with a simple fallback mechanism that preserves existing behavior while preventing silent failures. The logic is sound: when enabled jobs exist but lack nextRunAtMs, arm a maintenance timer to allow self-healing. Both positive and negative test cases validate the implementation. - No files require special attention <sub>Last reviewed commit: 25952d8</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs