#17064: fix(cron): prevent control-plane starvation during startup catch-up and manual runs
stale
size: M
Cluster:
Cron Job Stability Fixes
## Bug fix: prevent cron control-plane starvation during startup catch-up and manual runs
Fresh PR as requested, **with no workflow-file changes**.
### Scope
- `src/cron/service/ops.ts`
- `src/cron/service/timer.ts`
- `src/cron/service.runs-one-shot-main-job-disables-it.test.ts`
### What’s fixed
- This PR fixes a cron service control-plane starvation issue where cron list/status/run could time out while heavy job execution was happening (especially during startup catch-up of missed jobs or long manual runs).
- Startup catch-up replay moved out of startup lock scope.
- Manual run path split into lock-safe phases (prepare locked / execute unlocked / finalize locked).
- Stale unlocked store-read risk addressed by execution snapshot + re-find under lock in finalize.
- Startup replay result persistence aligned with safe run-state transitions.
- Wake-mode related tests stabilized for persisted state assertions.
### Symptoms seen by user:
- cron list / cron status / cron run hanging or timing out
- gateway timeout errors (e.g. 30s/120s) even though the process was alive
- control-plane felt “stuck” until long-running cron execution finished
### Root Cause
- control-plane operations were getting blocked/starved by execution paths that held scheduling/execution flow too long (startup catch-up + forced/manual run paths), so simple management RPCs couldn’t respond in time.
### How to reproduce
1. Configure at least one cron job that takes a long time (e.g. several minutes).
2. Trigger one of these pressure scenarios:
- Startup catch-up: restart service after missed runs so catch-up executes many due jobs.
- Manual run pressure: invoke cron run on a long-running job (especially forced/default-force behavior).
- While execution is in progress, repeatedly call:
• cron list
• cron status
• cron run (another job)
4. Observe control-plane timeouts / unresponsiveness.
### Local verification
- `corepack pnpm exec vitest run src/cron/**/*.test.ts` (106 passed)
- `corepack pnpm tsgo`
- `corepack pnpm format:check`
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR fixes cron control-plane starvation by splitting long-running operations (startup catch-up replay and manual runs) out of the lock scope using a prepare/execute/finalize pattern.
- **`start()` refactored** into three locked phases: (1) clear stale running markers and persist, (2) run missed jobs outside the lock via `runMissedJobs`, (3) force-reload store, recompute next runs, persist, and arm the timer. This prevents startup catch-up from blocking other control-plane operations.
- **`run()` split** into lock-safe phases: prepare (reserve run atomically + snapshot job under lock), execute (run job core unlocked with the snapshot), finalize (re-find job under lock, apply results, handle deletion, recompute, persist, re-arm timer).
- **`runMissedJobs()` refactored** with the same three-phase pattern per missed job, using `locked()` for prepare and finalize while executing unlocked.
- **`applyJobResult` and `executeJobCore` exported** from `timer.ts` to be reused in `ops.ts` for the new finalize phases.
- **Tests stabilized**: assertions now re-fetch job state via `cron.list()` instead of using stale in-memory references, `deleteAfterRun: false` added where needed, and `vi.waitFor()` used for async heartbeat assertions.
- Removed `collectRunnableJobs`/`isRunnableJob` helpers (inlined) and `ensureLoadedForRead` (inlined into `status`/`list`).
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — the core locking pattern is sound, defensive checks cover edge cases, and tests have been adapted to the new execution model.
- The three-phase prepare/execute/finalize pattern is a well-established concurrency approach. The finalize phases correctly force-reload state under lock to avoid stale reads. The execution snapshot via JSON deep clone prevents data corruption. Tests pass and cover the key scenarios. Minor style concern with unreachable defensive guards in `run()`, but no logical issues found.
- `src/cron/service/ops.ts` deserves careful review of the `run()` function's three-phase split, particularly the type narrowing guards at lines 247-255 which are unreachable but harmless defensive code.
<sub>Last reviewed commit: 8a42c01</sub>
<!-- greptile_other_comments_section -->
<sub>(5/5) You can turn off certain types of comments like style [here](https://app.greptile.com/review/github)!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#16888: fix(cron): execute missed jobs outside the lock to unblock list/sta...
by hou-rong · 2026-02-15
87.7%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
85.4%
#17561: fix(cron): add runtime staleness guard for runningAtMs (#17554)
by robbyczgw-cla · 2026-02-15
84.4%
#12303: fix(cron): correct nextRunAtMs calculation and prevent timer stall
by colddonkey · 2026-02-09
83.4%
#17949: fix: clear stale runningAtMs in cron.run() before already-running c...
by yasumorishima · 2026-02-16
83.4%
#17664: fix(cron): detect and clear stale runningAtMs marker in manual run ...
by echoVic · 2026-02-16
83.2%
#10829: fix: prevent cron scheduler permanent death on transient startup/ru...
by meaadore1221-afk · 2026-02-07
83.1%
#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection
by namabile · 2026-02-17
82.5%
#13065: fix(cron): Fix "every" schedule not re-arming after gateway restart
by trevorgordon981 · 2026-02-10
82.0%
#13055: fix: prevent cron RPC stalls with timeout and caching (#13018)
by trevorgordon981 · 2026-02-10
81.8%