#5179: fix(cron): recover stale running markers
Cluster:
Cron Job Stability Fixes
## What
Cron jobs can be left stuck with `state.runningAtMs` (e.g. process restart/crash mid-run). When that happens, the scheduler permanently treats the job as "already running" and will never run it again.
This PR recovers from that stale marker by clearing `runningAtMs` after the existing stuck-run threshold and recording an error status so operators can see what happened.
## Changes
- When a job's `runningAtMs` is older than the stuck-run threshold, clear the marker and set:
- `lastStatus="error"`
- `lastError="Recovered from stale running state…"` (if not already set)
- `lastRunAtMs` + `lastDurationMs`
- Add unit test covering recovery on `cron.start()`.
## Why
Prevents a single interrupted run from wedging a cron job forever.
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR adds recovery for cron jobs that were left in a stuck/running state (e.g. crash mid-run). During `recomputeNextRuns`, if `state.runningAtMs` is older than the stuck-run threshold, the code now clears the running marker and records the interrupted run (`lastRunAtMs`, `lastDurationMs`, `lastStatus="error"`, and a default `lastError` message if one isn’t already set), so the job can be scheduled again and operators can see what happened.
A new unit test writes a persisted job with an old `runningAtMs`, starts the cron service, and asserts the stale marker is cleared and the error metadata is recorded.
<h3>Confidence Score: 4/5</h3>
- This PR looks safe to merge and a real wedged-job scenario; remaining concerns are mainly test brittleness and small naming clarity issues.
- The recovery logic is localized to `recomputeNextRuns` and matches the rest of the cron state model (timer code treats `lastRunAtMs` as start time and sets duration similarly). The added test covers the intended behavior at startup. I didn’t find a functional regression in the changed logic, but the test’s hard-coded threshold could become flaky on future refactors, and some naming/wording is slightly inconsistent.
- src/cron/service.clears-stale-running-marker.test.ts (threshold coupling); src/cron/service/jobs.ts (naming/terminology consistency).
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#12018: fix(cron): clear stale running markers based on job timeout
by benzer25 · 2026-02-08
87.0%
#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)
by BinHPdev · 2026-02-16
85.9%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
85.2%
#17949: fix: clear stale runningAtMs in cron.run() before already-running c...
by yasumorishima · 2026-02-16
84.4%
#12303: fix(cron): correct nextRunAtMs calculation and prevent timer stall
by colddonkey · 2026-02-09
83.1%
#12982: fix(cron): prevent status/list from advancing overdue job nextRunAtMs
by hclsys · 2026-02-10
83.0%
#17643: fix: clear stale runningAtMs in cron.run to allow manual triggers
by MisterGuy420 · 2026-02-16
82.9%
#11857: fix: recompute stale cron nextRunAtMs on gateway restart
by Yida-Dev · 2026-02-08
82.8%
#17895: fix(cron): add staleness check for runningAtMs on manual trigger
by PlayerGhost · 2026-02-16
82.6%
#12443: fix(cron): don't advance past-due jobs that haven't been executed
by rummangeminicode · 2026-02-09
81.9%