#17561: fix(cron): add runtime staleness guard for runningAtMs (#17554)
size: S
experienced-contributor
Cluster:
Cron Job Stability Fixes
## Problem
Cron jobs can retain a stale `runningAtMs` timestamp after completion, permanently blocking future runs. Manual triggers via `cron.run` return `{"ran": false, "reason": "already-running"}` even though the job finished successfully (`lastStatus: "ok"`, `consecutiveErrors: 0`).
### Root cause
All four scheduling/run paths unconditionally skip jobs where `runningAtMs` is set. A staleness check exists on **startup** (`ops.ts:27-32`) but not at **runtime**, so if a job completes but the marker persists (e.g. due to a race between timer ticks, a session timeout not propagating back, or an unhandled edge case in `executeJobCore`), the job is stuck until manual intervention or a full restart.
### How users hit this
1. Job runs on an `every` schedule and completes successfully
2. `runningAtMs` is not cleared despite `lastStatus: "ok""
3. All subsequent timer ticks and manual `cron.run` calls skip the job
4. Only workaround: manually edit `jobs.json` or restart the gateway
Closes #17554
## Fix
Add a runtime staleness guard: if `runningAtMs` is older than `2 × job timeout`, clear it with a warning log and allow the job to run again. This mirrors the existing startup behavior in `ops.ts`.
### Changes
**`src/cron/service/timer.ts`:**
- `resolveJobTimeoutMs(job, default)` — extracts inline timeout calculation (deduplicates from `onTimer`)
- `isStaleRunning(job, nowMs, default)` — checks if `runningAtMs` exceeds `2 × timeout`
- `clearStaleRunningMarker(state, job, nowMs, check)` — clears marker + warning log (exported for use in ops.ts)
- Applied in all 3 filter functions: `findDueJobs`, `runMissedJobs`, `runDueJobs`
**`src/cron/service/ops.ts`:**
- `run()` function now also applies the staleness guard before returning `already-running` — this was the exact user-facing path reported in #17554
The `2×` multiplier provides a generous buffer — a job must be "running" for twice its timeout before being considered stale, avoiding false positives for legitimately long-running jobs.
## Testing
All 110 existing cron tests pass.
---
## Local Validation
- `pnpm build` ✅
- `pnpm check` (oxlint) ✅
- Relevant test suites pass ✅
## Contribution Checklist
- [x] Focused scope — single fix per PR
- [x] Clear "what" + "why" in description
- [x] AI-assisted (Codex/Claude) — reviewed and tested by human
- [x] Local validation run (`pnpm build && pnpm check`)
*AI-assisted (Codex). Reviewed by human.*
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds a runtime staleness guard for the per-job `runningAtMs` marker in the cron scheduler. Previously, a stale `runningAtMs` could only be cleared on gateway startup (`ops.ts:start()`); this PR extends that protection to runtime by checking if `runningAtMs` exceeds `2× job timeout` and clearing it with a warning log.
- Extracts `resolveJobTimeoutMs()` helper to deduplicate the inline timeout calculation from `onTimer`
- Adds `isStaleRunning()` and `clearStaleRunningMarker()` functions in `timer.ts`
- Applies the staleness guard in `isRunnableJob()`, covering all three automatic scheduling paths (`findDueJobs`, `runMissedJobs`, `runDueJobs`)
- Applies the staleness guard in `ops.ts:run()` before returning `already-running`, directly fixing the user-facing symptom from #17554
- No new tests were added for the runtime staleness guard; existing tests all pass
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — it adds a well-scoped defensive guard that correctly mirrors existing startup behavior.
- The changes are focused, logically sound, and correctly address all four code paths where a stale `runningAtMs` could block job execution. The new staleness threshold (2× job timeout) provides a reasonable buffer against false positives. The helpers are clean and the refactoring of `resolveJobTimeoutMs` reduces duplication. Score is 4 rather than 5 because no new tests were added for the runtime staleness behavior, though the existing 110 tests continue to pass.
- No files require special attention — both changed files are straightforward and well-structured.
<sub>Last reviewed commit: 58490fe</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#17664: fix(cron): detect and clear stale runningAtMs marker in manual run ...
by echoVic · 2026-02-16
93.4%
#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)
by BinHPdev · 2026-02-16
90.8%
#17895: fix(cron): add staleness check for runningAtMs on manual trigger
by PlayerGhost · 2026-02-16
90.6%
#17643: fix: clear stale runningAtMs in cron.run to allow manual triggers
by MisterGuy420 · 2026-02-16
89.4%
#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection
by namabile · 2026-02-17
87.8%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
87.6%
#17949: fix: clear stale runningAtMs in cron.run() before already-running c...
by yasumorishima · 2026-02-16
86.8%
#12018: fix(cron): clear stale running markers based on job timeout
by benzer25 · 2026-02-08
85.0%
#17064: fix(cron): prevent control-plane starvation during startup catch-up...
by donggyu9208 · 2026-02-15
84.4%
#16132: fix(cron): prevent duplicate job fires via MIN_REFIRE_GAP_MS guard
by widingmarcus-cyber · 2026-02-14
83.9%