#17949: fix: clear stale runningAtMs in cron.run() before already-running check (#17554)

by yasumorishima open 2026-02-16 10:12 View on GitHub →

stale size: S

## Summary - `run()` in `ops.ts` now calls `recomputeNextRunsForMaintenance(state)` before the `already-running` guard - This reuses the existing `normalizeJobTickState` logic (STUCK_RUN_MS = 2h) to clear stale `runningAtMs` markers - Minimal change: 1 line in `ops.ts`, import already existed ## Problem After a Phase-1 crash (job marked as running, persist written, then crash before Phase-3 clear), a manual `cron.run()` was blocked for up to 2 hours with `"already-running"` even though no job was actually running. The timer path (`onTimer`) called `recomputeNextRuns` → `walkSchedulableJobs` → `normalizeJobTickState` which cleared the stale marker after STUCK_RUN_MS. But `run()` used `ensureLoaded({ skipRecompute: true })`, bypassing this normalization entirely. ## Fix ```typescript await ensureLoaded(state, { skipRecompute: true }); // Normalize job tick state (clears stale runningAtMs markers) before // checking if already running, so a stale marker from a crashed Phase-1 // persist does not block manual triggers for up to STUCK_RUN_MS (#17554). recomputeNextRunsForMaintenance(state); const job = findJobOrThrow(state, id); ``` `recomputeNextRunsForMaintenance` is the maintenance-only variant that clears stuck markers and fills missing `nextRunAtMs`, but does NOT advance past-due values — safe to call on read/write paths. ## Test Added regression test in `service.issue-regressions.test.ts`: - Writes a job with `runningAtMs` set to `now - STUCK_RUN_MS - 1` (stale by 1ms) - Calls `run()` directly without `start()` (start clears all markers on boot, which would hide the bug) - Asserts `{ ok: true, ran: true }` and that `enqueueSystemEvent` was called ## Note This PR was developed with the assistance of [Claude Code](https://claude.ai/claude-code).  <h3>Greptile Summary</h3> Fixes a bug where `run()` in `ops.ts` could be blocked by a stale `runningAtMs` marker left over from a Phase-1 crash. The fix adds a single call to `recomputeNextRunsForMaintenance(state)` before the already-running guard, reusing the existing `normalizeJobTickState` logic (with `STUCK_RUN_MS` = 2h threshold) to clear stuck markers. This mirrors the pattern already used by `ensureLoadedForRead` and the timer path (`onTimer`). - Added `recomputeNextRunsForMaintenance(state)` call in `run()` between `ensureLoaded` and the `runningAtMs` check - Added a regression test that writes a job with a stale `runningAtMs` (expired by 1ms past `STUCK_RUN_MS`), calls `run()` directly (bypassing `start()` which would mask the bug), and asserts successful execution <h3>Confidence Score: 5/5</h3> - This PR is safe to merge — minimal, well-scoped fix that reuses existing normalization logic with a targeted regression test. - The change is a single line addition in `ops.ts` that calls an existing, well-tested maintenance function (`recomputeNextRunsForMaintenance`) before the already-running guard in `run()`. The import already existed. The function is the same one used by `ensureLoadedForRead` for read-only operations, and its behavior (clearing stuck markers older than 2 hours, filling missing `nextRunAtMs` without advancing past-due values) is safe for the `run()` write path. The regression test directly targets the bug scenario and follows existing test patterns in the file. - No files require special attention. <sub>Last reviewed commit: 3d44252</sub>  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>