#12018: fix(cron): clear stale running markers based on job timeout
stale
Cluster:
Cron Job Stability Fixes
## Problem\nCron jobs can get stuck as "already running" if a run crashes after setting runningAtMs but before applyJobResult clears it. The lock only clears after 2h (STUCK_RUN_MS), which causes false "already running" for short interval jobs.\n\n## Fix\nClear stale running markers when:\n- now - runningAtMs > (job timeout + 30s grace)\nThis is applied in:\n- cron run (manual) path\n- findDueJobs (scheduler)\n\n## Notes\nThis keeps long jobs safe by honoring job timeoutSeconds when set, or DEFAULT_JOB_TIMEOUT_MS.\n
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR attempts to reduce false “already running” cron locks by clearing `runningAtMs` markers once they exceed a job-specific timeout (+30s grace), applying the logic both in the manual `run` path and in the scheduler’s due/missed job selection.
Main concern: the scheduler-side clearing is currently performed inside filter predicates without guaranteeing a subsequent `persist()`, so in common cases where a job is stale-but-not-due the marker may be cleared only in-memory and then reappear after the next reload, continuing to block execution.
<h3>Confidence Score: 3/5</h3>
- This PR is mergeable after fixing the stale-lock persistence issue and aligning timeout constants.
- Core idea is sound, but clearing `runningAtMs` without persisting can leave the system effectively unchanged after reload; additionally, differing fallback timeout constants between manual and scheduler paths can cause inconsistent staleness behavior.
- src/cron/service/timer.ts, src/cron/service/ops.ts
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
87.8%
#5179: fix(cron): recover stale running markers
by thatdaveb · 2026-01-31
87.0%
#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)
by BinHPdev · 2026-02-16
86.6%
#17895: fix(cron): add staleness check for runningAtMs on manual trigger
by PlayerGhost · 2026-02-16
85.6%
#17664: fix(cron): detect and clear stale runningAtMs marker in manual run ...
by echoVic · 2026-02-16
85.3%
#17643: fix: clear stale runningAtMs in cron.run to allow manual triggers
by MisterGuy420 · 2026-02-16
85.2%
#17561: fix(cron): add runtime staleness guard for runningAtMs (#17554)
by robbyczgw-cla · 2026-02-15
85.0%
#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection
by namabile · 2026-02-17
84.0%
#11108: fix(cron): prevent missed jobs from being skipped on timer recompute
by Bentlybro · 2026-02-07
83.4%
#10918: fix(cron): add tolerance for timer precision and skip due jobs in r...
by Cherwayway · 2026-02-07
82.6%