← Back to PRs

#16888: fix(cron): execute missed jobs outside the lock to unblock list/status queries

by hou-rong open 2026-02-15 06:37 View on GitHub →
size: S
Fixes #16890 ## Problem When the cron service starts, `start()` calls `runMissedJobs()` **inside** the `locked()` block. Each missed job can take up to its full timeout (default 10 min, typically configured to 2 min) to execute. During this time, every other operation that needs the lock — including `cron.list()` and `cron.status()` used by the WebUI — is blocked waiting for the promise chain to drain. In practice this means the WebUI shows no cron jobs (or hangs) for minutes after a gateway restart whenever there are past-due jobs. ## Root Cause `start()` in `ops.ts` acquires the serializing lock via `locked(state, ...)` and then calls `runMissedJobs()`, which iterates over all missed jobs and awaits `executeJob()` for each one — all while still holding the lock. By contrast, `onTimer()` in `timer.ts` already follows a correct 3-phase pattern: 1. **Lock** briefly to find due jobs and mark them as running 2. **Release** the lock and execute jobs 3. **Lock** briefly again to persist results `start()` was not following this pattern. ## Fix Restructure `start()` to use the same 3-phase approach as `onTimer()`: - **Phase 1 (locked):** Load state, clear stale running markers, identify missed jobs, mark them as `runningAtMs`, persist, arm timer, then release the lock. - **Phase 2 (unlocked):** Execute each missed job via `executeJobCore()` with a `Promise.race` timeout — identical to how `onTimer()` runs jobs. - **Phase 3 (locked):** Re-acquire the lock, reload state, apply results via `applyJobResult()`, emit events, handle `deleteAfterRun`, recompute next runs, persist, and re-arm the timer. To support this, `applyJobResult` and `executeJobCore` are now exported from `timer.ts` (previously module-private). ## Testing - Verified on a production gateway with 4 cron jobs (each with 120s timeout) - Before fix: `cron.list` / `cron.status` WebSocket calls blocked for 80-110+ seconds after restart - After fix: cron service starts in ~1 second, WebUI immediately shows all jobs <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR fixes a lock-contention issue where `start()` held the cron serialization lock for the entire duration of missed job execution, blocking `list()` and `status()` queries used by the WebUI. The fix restructures `start()` into the same 3-phase pattern already used by `onTimer()`: (1) lock briefly to identify missed jobs and mark them as running, (2) execute jobs outside the lock, (3) lock briefly to persist results. - The core fix is sound and mirrors the existing `onTimer()` pattern in `timer.ts` - `applyJobResult` and `executeJobCore` are exported from `timer.ts` to support the refactored `start()` - `runMissedJobs` and `runDueJobs` in `timer.ts` are now dead code (no remaining callers) and could be cleaned up - `DEFAULT_JOB_TIMEOUT_MS` is duplicated between `ops.ts` and `timer.ts` <h3>Confidence Score: 4/5</h3> - This PR is safe to merge — it correctly applies an established pattern from the same codebase to fix a real lock-contention issue. - The 3-phase lock-execute-lock pattern is already proven in `onTimer()`, and the new `start()` closely follows it. The lock serialization via `locked()` ensures Phase 3 sections cannot overlap. Minor style issues (dead code, duplicated constant) don't affect correctness. - No files require special attention — both changes are straightforward. `src/cron/service/timer.ts` has leftover dead code that can be cleaned up in a follow-up. <sub>Last reviewed commit: 39cc557</sub> <!-- greptile_other_comments_section --> <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub> <!-- /greptile_comment -->

Most Similar PRs