#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenance ticks
stale
size: XS
Cluster:
Cron Job Stability Fixes
## Problem
When a cron job fires and the spawned session times out or crashes, `runningAtMs` is never cleared, permanently blocking that cron job from executing again. The only recovery is manually toggling the job's `enabled` flag.
## Root Cause
Three gaps in the cron scheduler:
### 1. `executeJob()` has no timeout
The `onTimer()` code path wraps `executeJobCore()` in `Promise.race` with a timeout — but `executeJob()` (used by manual `cron run` and `runMissedJobs` on startup) calls `executeJobCore()` directly without any timeout. A hung execution leaves `runningAtMs` set indefinitely.
### 2. `armTimer()` stops scheduling when only stuck jobs remain
When the only enabled job has `runningAtMs` set, `nextWakeAtMs()` returns `undefined` (the stuck job isn't counted as having a valid `nextRunAtMs`). `armTimer()` then skips scheduling entirely, so the `STUCK_RUN_MS` safety net in `recomputeNextRuns()` never fires — no timer tick, no cleanup.
### 3. `STUCK_RUN_MS` is too conservative (2 hours)
Most cron jobs complete in seconds to minutes. A 2-hour stuck marker window is unnecessarily long.
## Fix
1. **Add `Promise.race` timeout to `executeJob()`** — mirrors the existing pattern in `onTimer()`, using `payload.timeoutSeconds` or `DEFAULT_JOB_TIMEOUT_MS` (10 min).
2. **Schedule maintenance ticks in `armTimer()` when stuck jobs exist** — detects enabled jobs with `runningAtMs` set and schedules a 5-minute maintenance tick so `recomputeNextRuns()` can clear expired markers.
3. **Reduce `STUCK_RUN_MS` from 2 hours to 30 minutes** — still generous as a safety net, but much more practical.
## Testing
The existing test suites for cron service/jobs should continue to pass. The `executeJob` timeout follows the exact same pattern already tested in `onTimer`.
Fixes #18120
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Fixes a critical issue where cron jobs become permanently blocked when execution times out or crashes. The PR addresses three gaps: adds timeout protection to `executeJob()`, schedules maintenance ticks to clear stuck markers when regular jobs aren't due, and reduces the stuck-marker threshold from 2 hours to 30 minutes for faster recovery.
<h3>Confidence Score: 4/5</h3>
- Safe to merge - fixes critical recovery gap with well-tested timeout pattern
- The changes follow existing patterns, address a real production issue, and the timeout logic is actually safer than the existing implementation. Minor deduction because the maintenance tick adds a new periodic scheduling path that hasn't been battle-tested yet.
- No files require special attention
<sub>Last reviewed commit: 63413e1</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection
by namabile · 2026-02-17
88.2%
#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)
by BinHPdev · 2026-02-16
88.1%
#17643: fix: clear stale runningAtMs in cron.run to allow manual triggers
by MisterGuy420 · 2026-02-16
87.9%
#17895: fix(cron): add staleness check for runningAtMs on manual trigger
by PlayerGhost · 2026-02-16
87.9%
#12018: fix(cron): clear stale running markers based on job timeout
by benzer25 · 2026-02-08
87.8%
#17561: fix(cron): add runtime staleness guard for runningAtMs (#17554)
by robbyczgw-cla · 2026-02-15
87.6%
#17949: fix: clear stale runningAtMs in cron.run() before already-running c...
by yasumorishima · 2026-02-16
86.4%
#12303: fix(cron): correct nextRunAtMs calculation and prevent timer stall
by colddonkey · 2026-02-09
86.2%
#17664: fix(cron): detect and clear stale runningAtMs marker in manual run ...
by echoVic · 2026-02-16
86.1%
#17064: fix(cron): prevent control-plane starvation during startup catch-up...
by donggyu9208 · 2026-02-15
85.4%