#7022: fix(cron): prevent schedule drift on gateway restart for 'every' jobs

by marciob open 2026-02-02 09:20 View on GitHub →

## Problem Jobs with `kind: 'every'` schedules drift forward on gateway restarts: 1. Job created at 23:40 with 4h interval → first run scheduled for 03:40 2. Gateway restarts at 03:29 (11 min before scheduled run) 3. Schedule recalculated using restart time as anchor → next run at 07:29 4. Job never runs at the originally intended time **Root cause:** `anchorMs` defaults to `nowMs` in `computeNextRunAtMs()` when not provided. On restart, `nowMs` is the restart time, not the creation time. ## Solution Two fixes: 1. **Persist anchor at creation**: In `createJob()`, set `schedule.anchorMs` to creation time for 'every' schedules when not explicitly provided. 2. **Catch-up missed jobs**: In `recomputeNextRuns()`, detect if a job should have run (based on `lastRunAtMs` or `createdAtMs`) but hasn't, and schedule it immediately. ## Tests Added `jobs.anchor-fix.test.ts` covering: - anchorMs auto-set on job creation - anchorMs preserved when explicitly provided - Catch-up scheduling for missed jobs - No false catch-up when job ran recently All existing cron tests pass (56/56).  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR addresses drift for `kind: "every"` cron jobs across gateway restarts by (1) persisting `schedule.anchorMs` at job creation when not provided and (2) adding a catch-up path in `recomputeNextRuns()` intended to schedule missed intervals immediately. The anchor persistence in `src/cron/service/jobs.ts` fits well with the existing `computeNextRunAtMs()` semantics (which otherwise default `anchorMs` to `nowMs`). However, the new catch-up logic appears ineffective for the real “restart before a scheduled run” scenario because `computeNextRunAtMs()` always returns a timestamp >= `now`, so comparing `nextRunAtMs` to `now` doesn’t reliably detect missed runs. Tests were added to cover anchor behavior and catch-up scheduling, but the catch-up test currently uses an interval-boundary restart time, which doesn’t reflect the drift scenario described in the PR and may pass even without catch-up behavior. <h3>Confidence Score: 2/5</h3> - Not safe to merge as-is due to likely non-functional catch-up behavior for missed `every` runs. - Anchor persistence looks correct and low risk, but the added catch-up condition appears to not trigger for the intended restart-drift case, and the associated test doesn’t exercise the realistic scenario. This could leave the bug partially fixed (or create a false sense of coverage). - src/cron/service/jobs.ts (catch-up logic), src/cron/service/jobs.anchor-fix.test.ts (catch-up test scenario)  <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>