#16888: fix(cron): execute missed jobs outside the lock to unblock list/status queries
size: S
Cluster:
Cron Job Stability Fixes
Fixes #16890
## Problem
When the cron service starts, `start()` calls `runMissedJobs()` **inside** the `locked()` block. Each missed job can take up to its full timeout (default 10 min, typically configured to 2 min) to execute. During this time, every other operation that needs the lock — including `cron.list()` and `cron.status()` used by the WebUI — is blocked waiting for the promise chain to drain.
In practice this means the WebUI shows no cron jobs (or hangs) for minutes after a gateway restart whenever there are past-due jobs.
## Root Cause
`start()` in `ops.ts` acquires the serializing lock via `locked(state, ...)` and then calls `runMissedJobs()`, which iterates over all missed jobs and awaits `executeJob()` for each one — all while still holding the lock.
By contrast, `onTimer()` in `timer.ts` already follows a correct 3-phase pattern:
1. **Lock** briefly to find due jobs and mark them as running
2. **Release** the lock and execute jobs
3. **Lock** briefly again to persist results
`start()` was not following this pattern.
## Fix
Restructure `start()` to use the same 3-phase approach as `onTimer()`:
- **Phase 1 (locked):** Load state, clear stale running markers, identify missed jobs, mark them as `runningAtMs`, persist, arm timer, then release the lock.
- **Phase 2 (unlocked):** Execute each missed job via `executeJobCore()` with a `Promise.race` timeout — identical to how `onTimer()` runs jobs.
- **Phase 3 (locked):** Re-acquire the lock, reload state, apply results via `applyJobResult()`, emit events, handle `deleteAfterRun`, recompute next runs, persist, and re-arm the timer.
To support this, `applyJobResult` and `executeJobCore` are now exported from `timer.ts` (previously module-private).
## Testing
- Verified on a production gateway with 4 cron jobs (each with 120s timeout)
- Before fix: `cron.list` / `cron.status` WebSocket calls blocked for 80-110+ seconds after restart
- After fix: cron service starts in ~1 second, WebUI immediately shows all jobs
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR fixes a lock-contention issue where `start()` held the cron serialization lock for the entire duration of missed job execution, blocking `list()` and `status()` queries used by the WebUI. The fix restructures `start()` into the same 3-phase pattern already used by `onTimer()`: (1) lock briefly to identify missed jobs and mark them as running, (2) execute jobs outside the lock, (3) lock briefly to persist results.
- The core fix is sound and mirrors the existing `onTimer()` pattern in `timer.ts`
- `applyJobResult` and `executeJobCore` are exported from `timer.ts` to support the refactored `start()`
- `runMissedJobs` and `runDueJobs` in `timer.ts` are now dead code (no remaining callers) and could be cleaned up
- `DEFAULT_JOB_TIMEOUT_MS` is duplicated between `ops.ts` and `timer.ts`
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — it correctly applies an established pattern from the same codebase to fix a real lock-contention issue.
- The 3-phase lock-execute-lock pattern is already proven in `onTimer()`, and the new `start()` closely follows it. The lock serialization via `locked()` ensures Phase 3 sections cannot overlap. Minor style issues (dead code, duplicated constant) don't affect correctness.
- No files require special attention — both changes are straightforward. `src/cron/service/timer.ts` has leftover dead code that can be cleaned up in a follow-up.
<sub>Last reviewed commit: 39cc557</sub>
<!-- greptile_other_comments_section -->
<sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#17064: fix(cron): prevent control-plane starvation during startup catch-up...
by donggyu9208 · 2026-02-15
87.7%
#10829: fix: prevent cron scheduler permanent death on transient startup/ru...
by meaadore1221-afk · 2026-02-07
84.3%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
84.2%
#13065: fix(cron): Fix "every" schedule not re-arming after gateway restart
by trevorgordon981 · 2026-02-10
83.0%
#18925: fix(cron): stagger missed jobs on restart to prevent gateway overload
by rexlunae · 2026-02-17
82.9%
#8034: fix(cron): run past-due one-shot jobs immediately on startup
by FelixFoster · 2026-02-03
82.6%
#17561: fix(cron): add runtime staleness guard for runningAtMs (#17554)
by robbyczgw-cla · 2026-02-15
82.5%
#12086: fix(cron): ensure timer callback fires for scheduled jobs
by divol89 · 2026-02-08
82.2%
#12303: fix(cron): correct nextRunAtMs calculation and prevent timer stall
by colddonkey · 2026-02-09
82.2%
#11108: fix(cron): prevent missed jobs from being skipped on timer recompute
by Bentlybro · 2026-02-07
82.1%