#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenance ticks

by taw0002 open 2026-02-16 15:05 View on GitHub →

stale size: XS

## Problem When a cron job fires and the spawned session times out or crashes, `runningAtMs` is never cleared, permanently blocking that cron job from executing again. The only recovery is manually toggling the job's `enabled` flag. ## Root Cause Three gaps in the cron scheduler: ### 1. `executeJob()` has no timeout The `onTimer()` code path wraps `executeJobCore()` in `Promise.race` with a timeout — but `executeJob()` (used by manual `cron run` and `runMissedJobs` on startup) calls `executeJobCore()` directly without any timeout. A hung execution leaves `runningAtMs` set indefinitely. ### 2. `armTimer()` stops scheduling when only stuck jobs remain When the only enabled job has `runningAtMs` set, `nextWakeAtMs()` returns `undefined` (the stuck job isn't counted as having a valid `nextRunAtMs`). `armTimer()` then skips scheduling entirely, so the `STUCK_RUN_MS` safety net in `recomputeNextRuns()` never fires — no timer tick, no cleanup. ### 3. `STUCK_RUN_MS` is too conservative (2 hours) Most cron jobs complete in seconds to minutes. A 2-hour stuck marker window is unnecessarily long. ## Fix 1. **Add `Promise.race` timeout to `executeJob()`** — mirrors the existing pattern in `onTimer()`, using `payload.timeoutSeconds` or `DEFAULT_JOB_TIMEOUT_MS` (10 min). 2. **Schedule maintenance ticks in `armTimer()` when stuck jobs exist** — detects enabled jobs with `runningAtMs` set and schedules a 5-minute maintenance tick so `recomputeNextRuns()` can clear expired markers. 3. **Reduce `STUCK_RUN_MS` from 2 hours to 30 minutes** — still generous as a safety net, but much more practical. ## Testing The existing test suites for cron service/jobs should continue to pass. The `executeJob` timeout follows the exact same pattern already tested in `onTimer`. Fixes #18120  <h3>Greptile Summary</h3> Fixes a critical issue where cron jobs become permanently blocked when execution times out or crashes. The PR addresses three gaps: adds timeout protection to `executeJob()`, schedules maintenance ticks to clear stuck markers when regular jobs aren't due, and reduces the stuck-marker threshold from 2 hours to 30 minutes for faster recovery. <h3>Confidence Score: 4/5</h3> - Safe to merge - fixes critical recovery gap with well-tested timeout pattern - The changes follow existing patterns, address a real production issue, and the timeout logic is actually safer than the existing implementation. Minor deduction because the maintenance tick adds a new periodic scheduling path that hasn't been battle-tested yet. - No files require special attention <sub>Last reviewed commit: 63413e1</sub>