← Back to PRs

#17664: fix(cron): detect and clear stale runningAtMs marker in manual run (#17554)

by echoVic open 2026-02-16 02:16 View on GitHub →
size: XS
## Problem A cron job can complete successfully (`lastStatus: 'ok'`) but retain a stale `runningAtMs` timestamp in its state. This causes `cron.run` (manual trigger) to return `{ran: false, reason: 'already-running'}` indefinitely, blocking the job from ever being triggered again. The startup path (`ops.start`) already clears stale markers, but if the gateway doesn't restart, the job stays stuck. ## Fix Add a staleness check in `ops.run`: if `runningAtMs` is older than 2× the job's configured timeout (or 20 min default), treat it as stale, log a warning, clear it, and allow the job to run. This is a defensive fix that handles edge cases where `runningAtMs` is persisted but never cleared (e.g. process crash between persist and `applyJobResult`, or store reload race in the timer tick path). ## Checklist - [x] Local validation: `oxlint` (type-aware) + `oxfmt` passed - [x] Focused scope: Single staleness check added to `ops.run` - [x] AI-assistance: AI-assisted for codebase exploration; manually verified Fixes #17554 <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR adds a defensive staleness check to `ops.run` (manual cron job trigger) that detects and clears stale `runningAtMs` markers. When a job's `runningAtMs` is older than 2× its configured timeout (default 20 minutes), the marker is treated as stale, a warning is logged, and the job is allowed to run again. - Exports `DEFAULT_JOB_TIMEOUT_MS` from `timer.ts` and reuses it in `ops.run` for consistent timeout computation - The staleness check mirrors the timeout computation pattern already used in the timer tick execution path (`timer.ts:230-233`) - Complements the existing startup-time cleanup in `ops.start` and the 2-hour `STUCK_RUN_MS` cleanup in `normalizeJobTickState` (used during timer ticks) - No test coverage was added for the new `ops.run` staleness path; existing tests cover the `start` path cleanup (`service.restart-catchup.test.ts`) <h3>Confidence Score: 4/5</h3> - This PR is safe to merge — it adds a well-scoped defensive check with no impact on the happy path. - The change is small, focused, and follows existing patterns exactly. The staleness detection logic is sound and the timeout computation mirrors the timer tick path. The only concern is the absence of test coverage for the new code path, but the logic is straightforward and the risk of regression is low. - No files require special attention. Both changes are minimal and well-contained. <sub>Last reviewed commit: 2e9a59b</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs