#5179: fix(cron): recover stale running markers

by thatdaveb open 2026-01-31 04:49 View on GitHub →

## What Cron jobs can be left stuck with `state.runningAtMs` (e.g. process restart/crash mid-run). When that happens, the scheduler permanently treats the job as "already running" and will never run it again. This PR recovers from that stale marker by clearing `runningAtMs` after the existing stuck-run threshold and recording an error status so operators can see what happened. ## Changes - When a job's `runningAtMs` is older than the stuck-run threshold, clear the marker and set: - `lastStatus="error"` - `lastError="Recovered from stale running state…"` (if not already set) - `lastRunAtMs` + `lastDurationMs` - Add unit test covering recovery on `cron.start()`. ## Why Prevents a single interrupted run from wedging a cron job forever.  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR adds recovery for cron jobs that were left in a stuck/running state (e.g. crash mid-run). During `recomputeNextRuns`, if `state.runningAtMs` is older than the stuck-run threshold, the code now clears the running marker and records the interrupted run (`lastRunAtMs`, `lastDurationMs`, `lastStatus="error"`, and a default `lastError` message if one isn’t already set), so the job can be scheduled again and operators can see what happened. A new unit test writes a persisted job with an old `runningAtMs`, starts the cron service, and asserts the stale marker is cleared and the error metadata is recorded. <h3>Confidence Score: 4/5</h3> - This PR looks safe to merge and a real wedged-job scenario; remaining concerns are mainly test brittleness and small naming clarity issues. - The recovery logic is localized to `recomputeNextRuns` and matches the rest of the cron state model (timer code treats `lastRunAtMs` as start time and sets duration similarly). The added test covers the intended behavior at startup. I didn’t find a functional regression in the changed logic, but the test’s hard-coded threshold could become flaky on future refactors, and some naming/wording is slightly inconsistent. - src/cron/service.clears-stale-running-marker.test.ts (threshold coupling); src/cron/service/jobs.ts (naming/terminology consistency).  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>