#13065: fix(cron): Fix "every" schedule not re-arming after gateway restart
docs
stale
Cluster:
Cron Job Management Fixes
## Summary
Fixes #13036
## Problem
Cron jobs using `schedule.kind: "every"` (interval-based scheduling) silently stop firing after a gateway restart. The `nextRunAtMs` is set in the job state, but the timer never actually fires. Jobs remain `enabled: true` with a valid `nextRunAtMs` in the future/past, yet no runs occur.
## Root Cause
During gateway restart, the cron service startup sequence had a race condition:
1. `runMissedJobs` was called within a lock that loaded state from disk
2. Job execution updated `nextRunAtMs` in memory but changes were not properly persisted
3. The timer was armed with stale state, causing "every" schedules to not fire
Additionally, the `recomputeNextRuns` function would skip recomputing "every" schedules if their `nextRunAtMs` was still in the future, even if the timing had drifted during downtime.
## Solution
This PR restructures the cron service startup to properly handle state persistence:
### 1. Split `start()` into 3 distinct phases:
- **Phase 1**: Load store and clear stale markers (under lock)
- **Phase 2**: Run missed jobs (outside lock with its own locking)
- **Phase 3**: Recompute and arm timer (under lock)
### 2. Update `runMissedJobs` to manage its own locking:
- Find and mark missed jobs as running under lock
- Execute jobs
- Persist updated state after execution under lock
### 3. Always revalidate "every" schedules in `recomputeNextRuns`:
- "every" schedules are always recomputed to ensure correct intervals
- Other schedule types preserve future `nextRunAtMs` as before
## Testing
Tested with a reproduction script that:
1. Creates an "every" schedule job running every few seconds
2. Lets it run normally
3. Stops the service (simulating gateway shutdown)
4. Restarts the service after a delay
5. Verifies jobs resume firing at the correct interval
Before fix: Jobs stop firing after restart
After fix: Jobs correctly resume and maintain proper intervals
## Related Issues
- Resolves the interval drift issue mentioned in #13036 where jobs were firing at incorrect intervals
- Improves overall cron scheduler reliability after gateway restarts
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR fixes a cron scheduler restart bug by restructuring `CronService.start()` into explicit phases (load/cleanup under lock → run missed jobs outside lock → recompute + arm timer under lock), updating `runMissedJobs` to handle its own locking and persistence, and forcing recomputation for interval (`schedule.kind: "every"`) jobs to avoid drift after downtime.
Separately, it introduces a new “guard model” concept for prompt-injection sanitization, adds config/schema/types for it, and includes a basic `sanitizeWithGuardModel` implementation with tests.
Key integration points are the cron service (`src/cron/service/*`) and agent defaults config schema (`src/config/*`).
<h3>Confidence Score: 2/5</h3>
- Not safe to merge as-is due to a definite config schema import/export mismatch that will break validation/builds.
- Cron changes look directionally correct, but the new guard-model config wiring imports `GuardModelConfigSchema` from a module that doesn’t export it in this PR, which will cause a runtime/build error. There are also smaller correctness/maintainability issues (unused imports, inconsistent timestamp usage) that should be cleaned up before merge.
- src/config/zod-schema.agent-defaults.ts, src/config/zod-schema.providers-core.ts (or wherever GuardModelConfigSchema should live), src/security/guard-model.ts
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#9060: Fix: Preserve scheduled cron jobs after gateway restart
by vishaltandale00 · 2026-02-04
85.4%
#7022: fix(cron): prevent schedule drift on gateway restart for 'every' jobs
by marciob · 2026-02-02
84.3%
#11857: fix: recompute stale cron nextRunAtMs on gateway restart
by Yida-Dev · 2026-02-08
83.8%
#10829: fix: prevent cron scheduler permanent death on transient startup/ru...
by meaadore1221-afk · 2026-02-07
83.5%
#16888: fix(cron): execute missed jobs outside the lock to unblock list/sta...
by hou-rong · 2026-02-15
83.0%
#12303: fix(cron): correct nextRunAtMs calculation and prevent timer stall
by colddonkey · 2026-02-09
83.0%
#12747: fix: catch up missed cron-expression job runs on restart
by obin94-commits · 2026-02-09
82.9%
#8698: fix(cron): default enabled to true for new jobs
by emmick4 · 2026-02-04
82.9%
#14667: fix: preserve missed cron runs when updating job schedule
by WalterSumbon · 2026-02-12
82.3%
#17064: fix(cron): prevent control-plane starvation during startup catch-up...
by donggyu9208 · 2026-02-15
82.0%