← Back to PRs

#12018: fix(cron): clear stale running markers based on job timeout

by benzer25 open 2026-02-08 18:26 View on GitHub →
stale
## Problem\nCron jobs can get stuck as "already running" if a run crashes after setting runningAtMs but before applyJobResult clears it. The lock only clears after 2h (STUCK_RUN_MS), which causes false "already running" for short interval jobs.\n\n## Fix\nClear stale running markers when:\n- now - runningAtMs > (job timeout + 30s grace)\nThis is applied in:\n- cron run (manual) path\n- findDueJobs (scheduler)\n\n## Notes\nThis keeps long jobs safe by honoring job timeoutSeconds when set, or DEFAULT_JOB_TIMEOUT_MS.\n <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR attempts to reduce false “already running” cron locks by clearing `runningAtMs` markers once they exceed a job-specific timeout (+30s grace), applying the logic both in the manual `run` path and in the scheduler’s due/missed job selection. Main concern: the scheduler-side clearing is currently performed inside filter predicates without guaranteeing a subsequent `persist()`, so in common cases where a job is stale-but-not-due the marker may be cleared only in-memory and then reappear after the next reload, continuing to block execution. <h3>Confidence Score: 3/5</h3> - This PR is mergeable after fixing the stale-lock persistence issue and aligning timeout constants. - Core idea is sound, but clearing `runningAtMs` without persisting can leave the system effectively unchanged after reload; additionally, differing fallback timeout constants between manual and scheduler paths can cause inconsistent staleness behavior. - src/cron/service/timer.ts, src/cron/service/ops.ts <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs