← Back to PRs

#17561: fix(cron): add runtime staleness guard for runningAtMs (#17554)

by robbyczgw-cla open 2026-02-15 22:20 View on GitHub →
size: S experienced-contributor
## Problem Cron jobs can retain a stale `runningAtMs` timestamp after completion, permanently blocking future runs. Manual triggers via `cron.run` return `{"ran": false, "reason": "already-running"}` even though the job finished successfully (`lastStatus: "ok"`, `consecutiveErrors: 0`). ### Root cause All four scheduling/run paths unconditionally skip jobs where `runningAtMs` is set. A staleness check exists on **startup** (`ops.ts:27-32`) but not at **runtime**, so if a job completes but the marker persists (e.g. due to a race between timer ticks, a session timeout not propagating back, or an unhandled edge case in `executeJobCore`), the job is stuck until manual intervention or a full restart. ### How users hit this 1. Job runs on an `every` schedule and completes successfully 2. `runningAtMs` is not cleared despite `lastStatus: "ok"" 3. All subsequent timer ticks and manual `cron.run` calls skip the job 4. Only workaround: manually edit `jobs.json` or restart the gateway Closes #17554 ## Fix Add a runtime staleness guard: if `runningAtMs` is older than `2 × job timeout`, clear it with a warning log and allow the job to run again. This mirrors the existing startup behavior in `ops.ts`. ### Changes **`src/cron/service/timer.ts`:** - `resolveJobTimeoutMs(job, default)` — extracts inline timeout calculation (deduplicates from `onTimer`) - `isStaleRunning(job, nowMs, default)` — checks if `runningAtMs` exceeds `2 × timeout` - `clearStaleRunningMarker(state, job, nowMs, check)` — clears marker + warning log (exported for use in ops.ts) - Applied in all 3 filter functions: `findDueJobs`, `runMissedJobs`, `runDueJobs` **`src/cron/service/ops.ts`:** - `run()` function now also applies the staleness guard before returning `already-running` — this was the exact user-facing path reported in #17554 The `2×` multiplier provides a generous buffer — a job must be "running" for twice its timeout before being considered stale, avoiding false positives for legitimately long-running jobs. ## Testing All 110 existing cron tests pass. --- ## Local Validation - `pnpm build` ✅ - `pnpm check` (oxlint) ✅ - Relevant test suites pass ✅ ## Contribution Checklist - [x] Focused scope — single fix per PR - [x] Clear "what" + "why" in description - [x] AI-assisted (Codex/Claude) — reviewed and tested by human - [x] Local validation run (`pnpm build && pnpm check`) *AI-assisted (Codex). Reviewed by human.* <!-- greptile_comment --> <h3>Greptile Summary</h3> Adds a runtime staleness guard for the per-job `runningAtMs` marker in the cron scheduler. Previously, a stale `runningAtMs` could only be cleared on gateway startup (`ops.ts:start()`); this PR extends that protection to runtime by checking if `runningAtMs` exceeds `2× job timeout` and clearing it with a warning log. - Extracts `resolveJobTimeoutMs()` helper to deduplicate the inline timeout calculation from `onTimer` - Adds `isStaleRunning()` and `clearStaleRunningMarker()` functions in `timer.ts` - Applies the staleness guard in `isRunnableJob()`, covering all three automatic scheduling paths (`findDueJobs`, `runMissedJobs`, `runDueJobs`) - Applies the staleness guard in `ops.ts:run()` before returning `already-running`, directly fixing the user-facing symptom from #17554 - No new tests were added for the runtime staleness guard; existing tests all pass <h3>Confidence Score: 4/5</h3> - This PR is safe to merge — it adds a well-scoped defensive guard that correctly mirrors existing startup behavior. - The changes are focused, logically sound, and correctly address all four code paths where a stale `runningAtMs` could block job execution. The new staleness threshold (2× job timeout) provides a reasonable buffer against false positives. The helpers are clean and the refactoring of `resolveJobTimeoutMs` reduces duplication. Score is 4 rather than 5 because no new tests were added for the runtime staleness behavior, though the existing 110 tests continue to pass. - No files require special attention — both changed files are straightforward and well-structured. <sub>Last reviewed commit: 58490fe</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs