#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)

by BinHPdev open 2026-02-16 15:56 View on GitHub →

channel: mattermost size: M

## Summary - Problem: When a cron job times out or the process crashes mid-execution, `runningAtMs` persists in the job state. Since nothing clears this marker, the job is permanently blocked from future runs — `isRunnableJob()` always returns false. - Why it matters: A single timeout or crash can permanently disable a cron job until manual intervention. Users may not notice for hours or days, causing missed heartbeats and scheduled tasks. - What changed: Added stale marker detection in `isRunnableJob()`. When `runningAtMs` has been set for longer than 2× the job's timeout threshold, it is automatically cleared with a warning log. This allows the job to resume scheduling. Also updated `collectRunnableJobs()` to pass the service state for logging. - What did NOT change (scope boundary): No changes to job execution, timeout handling, or the normal `runningAtMs` lifecycle. The auto-clear only triggers for markers that exceed 2× the configured timeout, which is a conservative threshold that avoids interfering with legitimately running jobs. ## Change Type (select all) - [x] Bug fix ## Scope (select all touched areas) - [x] Gateway / orchestration ## Linked Issue/PR - Closes #18120 ## User-visible / Behavior Changes Cron jobs that were stuck due to stale `runningAtMs` markers (from timeouts or crashes) will now automatically resume after 2× the job's timeout period. A warning is logged when this occurs. ## Security Impact (required) - New permissions/capabilities? No - Secrets/tokens handling changed? No - New/changed network calls? No - Command/tool execution surface changed? No - Data access scope changed? No ## Repro + Verification ### Environment - OS: macOS (arm64) - Runtime: Node.js ### Steps 1. Configure a cron job with a timeout (e.g., 60s) 2. Trigger the job and simulate a timeout/crash (kill the process mid-execution) 3. Wait for 2× the timeout period 4. Observe that the job resumes scheduling automatically ### Expected - Job resumes after the stale threshold with a warning log ### Actual (before fix) - Job is permanently blocked; never runs again until manual state reset ## Evidence - [x] Failing test/log before + passing after - All 125 cron tests pass: `pnpm vitest run src/cron/` ## Human Verification (required) - Verified scenarios: All 125 cron test cases pass. Verified stale detection threshold calculation for both `agentTurn` jobs (with custom `timeoutSeconds`) and default jobs (using `DEFAULT_JOB_TIMEOUT_MS`). - Edge cases checked: Jobs with `runningAtMs` that are still within the threshold are correctly skipped (not runnable). The 2× multiplier provides a safety margin so legitimately running jobs aren't interrupted. - What you did **not** verify: Live gateway with an actual cron job timeout/crash scenario. ## Compatibility / Migration - Backward compatible? Yes - Config/env changes? No - Migration needed? No ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: Remove the stale marker detection block from `isRunnableJob()` in `src/cron/service/timer.ts` - Known bad symptoms: If the threshold is too aggressive, a legitimately slow-running job could have its marker cleared and be scheduled for a duplicate run ## Risks and Mitigations - Risk: A job running longer than 2× its timeout could be incorrectly treated as stale. - Mitigation: The 2× multiplier is conservative. Jobs that run 2× over their timeout are almost certainly stuck. The default timeout is already generous (5 minutes), so the stale threshold is 10+ minutes. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  <h3>Greptile Summary</h3> This PR adds auto-recovery for stale `runningAtMs` markers in cron jobs. When a job times out or crashes mid-execution, its `runningAtMs` marker persists, permanently blocking future runs. The fix detects markers exceeding 2× the job's timeout and clears them with a warning log. The implementation adds stale marker detection in `isRunnableJob()` (`src/cron/service/timer.ts:363-381`). When `runningAtMs` duration exceeds the 2× timeout threshold, it's cleared and the job continues normal runnability checks. The change is conservative - the 2× multiplier provides safety margin so legitimately running jobs aren't interrupted. The fix follows existing patterns: startup already clears stale markers in `ops.ts:40-46`, and `normalizeJobTickState()` has a 2-hour fallback at `jobs.ts:159`. This extends that recovery to runtime checks. <h3>Confidence Score: 3/5</h3> - This PR is relatively safe but has a persistence issue that could cause repeated warning logs - The core logic is sound and well-tested (125 passing tests), but there's a persistence edge case identified in the previous thread that remains unresolved. The stale marker clearing may not persist when a job isn't due, causing the warning to repeat every timer tick (≤60s) until the job becomes due or the 2-hour `STUCK_RUN_MS` threshold in `normalizeJobTickState` kicks in. This doesn't break functionality but creates log noise. - `src/cron/service/timer.ts` needs attention for the persistence issue with stale marker clearing <sub>Last reviewed commit: 87c3ce0</sub>