#18925: fix(cron): stagger missed jobs on restart to prevent gateway overload

by rexlunae open 2026-02-17 05:53 View on GitHub →

size: M

When the gateway restarts with many overdue cron jobs, they are now executed with staggered delays to prevent overwhelming the gateway. - Add missedJobStaggerMs config (default 5s between jobs) - Add maxMissedJobsPerRestart limit (default 5 jobs immediately) - Prioritize most overdue jobs by sorting by nextRunAtMs - Reschedule deferred jobs to fire gradually via normal timer Fixes #18892 ## Summary Describe the problem and fix in 2–5 bullets: - Problem: - Why it matters: - What changed: - What did NOT change (scope boundary): ## Change Type (select all) - [ ] Bug fix - [ ] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [ ] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes # - Related # ## User-visible / Behavior Changes List user-visible changes (including defaults/config). If none, write `None`. ## Security Impact (required) - New permissions/capabilities? (`Yes/No`) - Secrets/tokens handling changed? (`Yes/No`) - New/changed network calls? (`Yes/No`) - Command/tool execution surface changed? (`Yes/No`) - Data access scope changed? (`Yes/No`) - If any `Yes`, explain risk + mitigation: ## Repro + Verification ### Environment - OS: - Runtime/container: - Model/provider: - Integration/channel (if any): - Relevant config (redacted): ### Steps 1. 2. 3. ### Expected - ### Actual - ## Evidence Attach at least one: - [ ] Failing test/log before + passing after - [ ] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) ## Human Verification (required) What you personally verified (not just CI), and how: - Verified scenarios: - Edge cases checked: - What you did **not** verify: ## Compatibility / Migration - Backward compatible? (`Yes/No`) - Config/env changes? (`Yes/No`) - Migration needed? (`Yes/No`) - If yes, exact upgrade steps: ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: - Files/config to restore: - Known bad symptoms reviewers should watch for: ## Risks and Mitigations List only real risks for this PR. Add/remove entries as needed. If none, write `None`. - Risk: - Mitigation:  <h3>Greptile Summary</h3> This PR addresses #18892 by staggering the execution of missed cron jobs when the gateway restarts with many overdue jobs. Previously, all missed jobs fired simultaneously, potentially overwhelming the gateway. - Adds `missedJobStaggerMs` (default 5s) and `maxMissedJobsPerRestart` (default 5) as configurable options on `CronServiceDeps` - Sorts missed jobs by `nextRunAtMs` (most overdue first), splits into immediate and deferred batches - Immediate batch runs sequentially with stagger delays between executions; deferred batch gets rescheduled to staggered future `nextRunAtMs` values picked up by the normal timer - The stagger scheduling is safe with respect to the `locked()` mechanism and `recomputeNextRuns` (which preserves future `nextRunAtMs` values) - **Issue**: deferred job stagger offsets start at `staggerMs` instead of `maxImmediate * staggerMs`, so the first deferred jobs will be past-due by the time `armTimer` runs (after the immediate batch completes), causing them to fire back-to-back without the intended staggering <h3>Confidence Score: 3/5</h3> - The PR is safe to merge but the deferred job stagger timing has an overlap issue that reduces its effectiveness under default settings. - The overall approach is sound and well-structured — sorting by overdue priority, splitting into immediate/deferred batches, and leveraging the existing timer mechanism for deferred execution. However, the deferred stagger offset calculation starts too early (at `staggerMs` from `now`), meaning the first several deferred jobs will be past-due by the time `armTimer` runs after the immediate batch completes, causing them to fire back-to-back and partially defeating the stagger purpose. The fix is straightforward (start offset at `maxImmediate * staggerMs`). - `src/cron/service/timer.ts` — the deferred job stagger offset calculation in `runMissedJobs` needs adjustment to account for immediate batch execution time. <sub>Last reviewed commit: eabeee0</sub>  <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>