#17949: fix: clear stale runningAtMs in cron.run() before already-running check (#17554)
stale
size: S
Cluster:
Cron Job Stability Fixes
## Summary
- `run()` in `ops.ts` now calls `recomputeNextRunsForMaintenance(state)` before the `already-running` guard
- This reuses the existing `normalizeJobTickState` logic (STUCK_RUN_MS = 2h) to clear stale `runningAtMs` markers
- Minimal change: 1 line in `ops.ts`, import already existed
## Problem
After a Phase-1 crash (job marked as running, persist written, then crash before Phase-3 clear), a manual `cron.run()` was blocked for up to 2 hours with `"already-running"` even though no job was actually running.
The timer path (`onTimer`) called `recomputeNextRuns` → `walkSchedulableJobs` → `normalizeJobTickState` which cleared the stale marker after STUCK_RUN_MS. But `run()` used `ensureLoaded({ skipRecompute: true })`, bypassing this normalization entirely.
## Fix
```typescript
await ensureLoaded(state, { skipRecompute: true });
// Normalize job tick state (clears stale runningAtMs markers) before
// checking if already running, so a stale marker from a crashed Phase-1
// persist does not block manual triggers for up to STUCK_RUN_MS (#17554).
recomputeNextRunsForMaintenance(state);
const job = findJobOrThrow(state, id);
```
`recomputeNextRunsForMaintenance` is the maintenance-only variant that clears stuck markers and fills missing `nextRunAtMs`, but does NOT advance past-due values — safe to call on read/write paths.
## Test
Added regression test in `service.issue-regressions.test.ts`:
- Writes a job with `runningAtMs` set to `now - STUCK_RUN_MS - 1` (stale by 1ms)
- Calls `run()` directly without `start()` (start clears all markers on boot, which would hide the bug)
- Asserts `{ ok: true, ran: true }` and that `enqueueSystemEvent` was called
## Note
This PR was developed with the assistance of [Claude Code](https://claude.ai/claude-code).
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Fixes a bug where `run()` in `ops.ts` could be blocked by a stale `runningAtMs` marker left over from a Phase-1 crash. The fix adds a single call to `recomputeNextRunsForMaintenance(state)` before the already-running guard, reusing the existing `normalizeJobTickState` logic (with `STUCK_RUN_MS` = 2h threshold) to clear stuck markers. This mirrors the pattern already used by `ensureLoadedForRead` and the timer path (`onTimer`).
- Added `recomputeNextRunsForMaintenance(state)` call in `run()` between `ensureLoaded` and the `runningAtMs` check
- Added a regression test that writes a job with a stale `runningAtMs` (expired by 1ms past `STUCK_RUN_MS`), calls `run()` directly (bypassing `start()` which would mask the bug), and asserts successful execution
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge — minimal, well-scoped fix that reuses existing normalization logic with a targeted regression test.
- The change is a single line addition in `ops.ts` that calls an existing, well-tested maintenance function (`recomputeNextRunsForMaintenance`) before the already-running guard in `run()`. The import already existed. The function is the same one used by `ensureLoadedForRead` for read-only operations, and its behavior (clearing stuck markers older than 2 hours, filling missing `nextRunAtMs` without advancing past-due values) is safe for the `run()` write path. The regression test directly targets the bug scenario and follows existing test patterns in the file.
- No files require special attention.
<sub>Last reviewed commit: 3d44252</sub>
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#17643: fix: clear stale runningAtMs in cron.run to allow manual triggers
by MisterGuy420 · 2026-02-16
87.9%
#17664: fix(cron): detect and clear stale runningAtMs marker in manual run ...
by echoVic · 2026-02-16
87.0%
#17561: fix(cron): add runtime staleness guard for runningAtMs (#17554)
by robbyczgw-cla · 2026-02-15
86.8%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
86.4%
#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)
by BinHPdev · 2026-02-16
85.0%
#17895: fix(cron): add staleness check for runningAtMs on manual trigger
by PlayerGhost · 2026-02-16
84.6%
#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection
by namabile · 2026-02-17
84.5%
#5179: fix(cron): recover stale running markers
by thatdaveb · 2026-01-31
84.4%
#14667: fix: preserve missed cron runs when updating job schedule
by WalterSumbon · 2026-02-12
84.3%
#12982: fix(cron): prevent status/list from advancing overdue job nextRunAtMs
by hclsys · 2026-02-10
83.8%