← Back to PRs

#13065: fix(cron): Fix "every" schedule not re-arming after gateway restart

by trevorgordon981 open 2026-02-10 02:57 View on GitHub →
docs stale
## Summary Fixes #13036 ## Problem Cron jobs using `schedule.kind: "every"` (interval-based scheduling) silently stop firing after a gateway restart. The `nextRunAtMs` is set in the job state, but the timer never actually fires. Jobs remain `enabled: true` with a valid `nextRunAtMs` in the future/past, yet no runs occur. ## Root Cause During gateway restart, the cron service startup sequence had a race condition: 1. `runMissedJobs` was called within a lock that loaded state from disk 2. Job execution updated `nextRunAtMs` in memory but changes were not properly persisted 3. The timer was armed with stale state, causing "every" schedules to not fire Additionally, the `recomputeNextRuns` function would skip recomputing "every" schedules if their `nextRunAtMs` was still in the future, even if the timing had drifted during downtime. ## Solution This PR restructures the cron service startup to properly handle state persistence: ### 1. Split `start()` into 3 distinct phases: - **Phase 1**: Load store and clear stale markers (under lock) - **Phase 2**: Run missed jobs (outside lock with its own locking) - **Phase 3**: Recompute and arm timer (under lock) ### 2. Update `runMissedJobs` to manage its own locking: - Find and mark missed jobs as running under lock - Execute jobs - Persist updated state after execution under lock ### 3. Always revalidate "every" schedules in `recomputeNextRuns`: - "every" schedules are always recomputed to ensure correct intervals - Other schedule types preserve future `nextRunAtMs` as before ## Testing Tested with a reproduction script that: 1. Creates an "every" schedule job running every few seconds 2. Lets it run normally 3. Stops the service (simulating gateway shutdown) 4. Restarts the service after a delay 5. Verifies jobs resume firing at the correct interval Before fix: Jobs stop firing after restart After fix: Jobs correctly resume and maintain proper intervals ## Related Issues - Resolves the interval drift issue mentioned in #13036 where jobs were firing at incorrect intervals - Improves overall cron scheduler reliability after gateway restarts <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR fixes a cron scheduler restart bug by restructuring `CronService.start()` into explicit phases (load/cleanup under lock → run missed jobs outside lock → recompute + arm timer under lock), updating `runMissedJobs` to handle its own locking and persistence, and forcing recomputation for interval (`schedule.kind: "every"`) jobs to avoid drift after downtime. Separately, it introduces a new “guard model” concept for prompt-injection sanitization, adds config/schema/types for it, and includes a basic `sanitizeWithGuardModel` implementation with tests. Key integration points are the cron service (`src/cron/service/*`) and agent defaults config schema (`src/config/*`). <h3>Confidence Score: 2/5</h3> - Not safe to merge as-is due to a definite config schema import/export mismatch that will break validation/builds. - Cron changes look directionally correct, but the new guard-model config wiring imports `GuardModelConfigSchema` from a module that doesn’t export it in this PR, which will cause a runtime/build error. There are also smaller correctness/maintainability issues (unused imports, inconsistent timestamp usage) that should be cleaned up before merge. - src/config/zod-schema.agent-defaults.ts, src/config/zod-schema.providers-core.ts (or wherever GuardModelConfigSchema should live), src/security/guard-model.ts <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs