#18925: fix(cron): stagger missed jobs on restart to prevent gateway overload
size: M
Cluster:
Cron Job Management Fixes
When the gateway restarts with many overdue cron jobs, they are now executed with staggered delays to prevent overwhelming the gateway.
- Add missedJobStaggerMs config (default 5s between jobs)
- Add maxMissedJobsPerRestart limit (default 5 jobs immediately)
- Prioritize most overdue jobs by sorting by nextRunAtMs
- Reschedule deferred jobs to fire gradually via normal timer
Fixes #18892
## Summary
Describe the problem and fix in 2–5 bullets:
- Problem:
- Why it matters:
- What changed:
- What did NOT change (scope boundary):
## Change Type (select all)
- [ ] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Docs
- [ ] Security hardening
- [ ] Chore/infra
## Scope (select all touched areas)
- [ ] Gateway / orchestration
- [ ] Skills / tool execution
- [ ] Auth / tokens
- [ ] Memory / storage
- [ ] Integrations
- [ ] API / contracts
- [ ] UI / DX
- [ ] CI/CD / infra
## Linked Issue/PR
- Closes #
- Related #
## User-visible / Behavior Changes
List user-visible changes (including defaults/config).
If none, write `None`.
## Security Impact (required)
- New permissions/capabilities? (`Yes/No`)
- Secrets/tokens handling changed? (`Yes/No`)
- New/changed network calls? (`Yes/No`)
- Command/tool execution surface changed? (`Yes/No`)
- Data access scope changed? (`Yes/No`)
- If any `Yes`, explain risk + mitigation:
## Repro + Verification
### Environment
- OS:
- Runtime/container:
- Model/provider:
- Integration/channel (if any):
- Relevant config (redacted):
### Steps
1.
2.
3.
### Expected
-
### Actual
-
## Evidence
Attach at least one:
- [ ] Failing test/log before + passing after
- [ ] Trace/log snippets
- [ ] Screenshot/recording
- [ ] Perf numbers (if relevant)
## Human Verification (required)
What you personally verified (not just CI), and how:
- Verified scenarios:
- Edge cases checked:
- What you did **not** verify:
## Compatibility / Migration
- Backward compatible? (`Yes/No`)
- Config/env changes? (`Yes/No`)
- Migration needed? (`Yes/No`)
- If yes, exact upgrade steps:
## Failure Recovery (if this breaks)
- How to disable/revert this change quickly:
- Files/config to restore:
- Known bad symptoms reviewers should watch for:
## Risks and Mitigations
List only real risks for this PR. Add/remove entries as needed. If none, write `None`.
- Risk:
- Mitigation:
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR addresses #18892 by staggering the execution of missed cron jobs when the gateway restarts with many overdue jobs. Previously, all missed jobs fired simultaneously, potentially overwhelming the gateway.
- Adds `missedJobStaggerMs` (default 5s) and `maxMissedJobsPerRestart` (default 5) as configurable options on `CronServiceDeps`
- Sorts missed jobs by `nextRunAtMs` (most overdue first), splits into immediate and deferred batches
- Immediate batch runs sequentially with stagger delays between executions; deferred batch gets rescheduled to staggered future `nextRunAtMs` values picked up by the normal timer
- The stagger scheduling is safe with respect to the `locked()` mechanism and `recomputeNextRuns` (which preserves future `nextRunAtMs` values)
- **Issue**: deferred job stagger offsets start at `staggerMs` instead of `maxImmediate * staggerMs`, so the first deferred jobs will be past-due by the time `armTimer` runs (after the immediate batch completes), causing them to fire back-to-back without the intended staggering
<h3>Confidence Score: 3/5</h3>
- The PR is safe to merge but the deferred job stagger timing has an overlap issue that reduces its effectiveness under default settings.
- The overall approach is sound and well-structured — sorting by overdue priority, splitting into immediate/deferred batches, and leveraging the existing timer mechanism for deferred execution. However, the deferred stagger offset calculation starts too early (at `staggerMs` from `now`), meaning the first several deferred jobs will be past-due by the time `armTimer` runs after the immediate batch completes, causing them to fire back-to-back and partially defeating the stagger purpose. The fix is straightforward (start offset at `maxImmediate * staggerMs`).
- `src/cron/service/timer.ts` — the deferred job stagger offset calculation in `runMissedJobs` needs adjustment to account for immediate batch execution time.
<sub>Last reviewed commit: eabeee0</sub>
<!-- greptile_other_comments_section -->
<sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#8034: fix(cron): run past-due one-shot jobs immediately on startup
by FelixFoster · 2026-02-03
85.0%
#12747: fix: catch up missed cron-expression job runs on restart
by obin94-commits · 2026-02-09
83.7%
#16888: fix(cron): execute missed jobs outside the lock to unblock list/sta...
by hou-rong · 2026-02-15
82.9%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
82.8%
#11108: fix(cron): prevent missed jobs from being skipped on timer recompute
by Bentlybro · 2026-02-07
82.3%
#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection
by namabile · 2026-02-17
82.2%
#23290: fix(cron): use lastRunAtMs for next schedule of interval jobs after...
by SidQin-cyber · 2026-02-22
82.1%
#12443: fix(cron): don't advance past-due jobs that haven't been executed
by rummangeminicode · 2026-02-09
82.1%
#10918: fix(cron): add tolerance for timer precision and skip due jobs in r...
by Cherwayway · 2026-02-07
81.9%
#16132: fix(cron): prevent duplicate job fires via MIN_REFIRE_GAP_MS guard
by widingmarcus-cyber · 2026-02-14
81.8%