#17664: fix(cron): detect and clear stale runningAtMs marker in manual run (#17554)
size: XS
Cluster:
Cron Job Stability Fixes
## Problem
A cron job can complete successfully (`lastStatus: 'ok'`) but retain a stale `runningAtMs` timestamp in its state. This causes `cron.run` (manual trigger) to return `{ran: false, reason: 'already-running'}` indefinitely, blocking the job from ever being triggered again.
The startup path (`ops.start`) already clears stale markers, but if the gateway doesn't restart, the job stays stuck.
## Fix
Add a staleness check in `ops.run`: if `runningAtMs` is older than 2× the job's configured timeout (or 20 min default), treat it as stale, log a warning, clear it, and allow the job to run.
This is a defensive fix that handles edge cases where `runningAtMs` is persisted but never cleared (e.g. process crash between persist and `applyJobResult`, or store reload race in the timer tick path).
## Checklist
- [x] Local validation: `oxlint` (type-aware) + `oxfmt` passed
- [x] Focused scope: Single staleness check added to `ops.run`
- [x] AI-assistance: AI-assisted for codebase exploration; manually verified
Fixes #17554
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR adds a defensive staleness check to `ops.run` (manual cron job trigger) that detects and clears stale `runningAtMs` markers. When a job's `runningAtMs` is older than 2× its configured timeout (default 20 minutes), the marker is treated as stale, a warning is logged, and the job is allowed to run again.
- Exports `DEFAULT_JOB_TIMEOUT_MS` from `timer.ts` and reuses it in `ops.run` for consistent timeout computation
- The staleness check mirrors the timeout computation pattern already used in the timer tick execution path (`timer.ts:230-233`)
- Complements the existing startup-time cleanup in `ops.start` and the 2-hour `STUCK_RUN_MS` cleanup in `normalizeJobTickState` (used during timer ticks)
- No test coverage was added for the new `ops.run` staleness path; existing tests cover the `start` path cleanup (`service.restart-catchup.test.ts`)
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — it adds a well-scoped defensive check with no impact on the happy path.
- The change is small, focused, and follows existing patterns exactly. The staleness detection logic is sound and the timeout computation mirrors the timer tick path. The only concern is the absence of test coverage for the new code path, but the logic is straightforward and the risk of regression is low.
- No files require special attention. Both changes are minimal and well-contained.
<sub>Last reviewed commit: 2e9a59b</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#17561: fix(cron): add runtime staleness guard for runningAtMs (#17554)
by robbyczgw-cla · 2026-02-15
93.4%
#17895: fix(cron): add staleness check for runningAtMs on manual trigger
by PlayerGhost · 2026-02-16
92.6%
#17643: fix: clear stale runningAtMs in cron.run to allow manual triggers
by MisterGuy420 · 2026-02-16
91.6%
#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)
by BinHPdev · 2026-02-16
88.1%
#17949: fix: clear stale runningAtMs in cron.run() before already-running c...
by yasumorishima · 2026-02-16
87.0%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
86.1%
#12018: fix(cron): clear stale running markers based on job timeout
by benzer25 · 2026-02-08
85.3%
#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection
by namabile · 2026-02-17
84.2%
#17064: fix(cron): prevent control-plane starvation during startup catch-up...
by donggyu9208 · 2026-02-15
83.2%
#5179: fix(cron): recover stale running markers
by thatdaveb · 2026-01-31
81.9%