#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)
channel: mattermost
size: M
Cluster:
Cron Job Stability Fixes
## Summary
- Problem: When a cron job times out or the process crashes mid-execution, `runningAtMs` persists in the job state. Since nothing clears this marker, the job is permanently blocked from future runs — `isRunnableJob()` always returns false.
- Why it matters: A single timeout or crash can permanently disable a cron job until manual intervention. Users may not notice for hours or days, causing missed heartbeats and scheduled tasks.
- What changed: Added stale marker detection in `isRunnableJob()`. When `runningAtMs` has been set for longer than 2× the job's timeout threshold, it is automatically cleared with a warning log. This allows the job to resume scheduling. Also updated `collectRunnableJobs()` to pass the service state for logging.
- What did NOT change (scope boundary): No changes to job execution, timeout handling, or the normal `runningAtMs` lifecycle. The auto-clear only triggers for markers that exceed 2× the configured timeout, which is a conservative threshold that avoids interfering with legitimately running jobs.
## Change Type (select all)
- [x] Bug fix
## Scope (select all touched areas)
- [x] Gateway / orchestration
## Linked Issue/PR
- Closes #18120
## User-visible / Behavior Changes
Cron jobs that were stuck due to stale `runningAtMs` markers (from timeouts or crashes) will now automatically resume after 2× the job's timeout period. A warning is logged when this occurs.
## Security Impact (required)
- New permissions/capabilities? No
- Secrets/tokens handling changed? No
- New/changed network calls? No
- Command/tool execution surface changed? No
- Data access scope changed? No
## Repro + Verification
### Environment
- OS: macOS (arm64)
- Runtime: Node.js
### Steps
1. Configure a cron job with a timeout (e.g., 60s)
2. Trigger the job and simulate a timeout/crash (kill the process mid-execution)
3. Wait for 2× the timeout period
4. Observe that the job resumes scheduling automatically
### Expected
- Job resumes after the stale threshold with a warning log
### Actual (before fix)
- Job is permanently blocked; never runs again until manual state reset
## Evidence
- [x] Failing test/log before + passing after
- All 125 cron tests pass: `pnpm vitest run src/cron/`
## Human Verification (required)
- Verified scenarios: All 125 cron test cases pass. Verified stale detection threshold calculation for both `agentTurn` jobs (with custom `timeoutSeconds`) and default jobs (using `DEFAULT_JOB_TIMEOUT_MS`).
- Edge cases checked: Jobs with `runningAtMs` that are still within the threshold are correctly skipped (not runnable). The 2× multiplier provides a safety margin so legitimately running jobs aren't interrupted.
- What you did **not** verify: Live gateway with an actual cron job timeout/crash scenario.
## Compatibility / Migration
- Backward compatible? Yes
- Config/env changes? No
- Migration needed? No
## Failure Recovery (if this breaks)
- How to disable/revert this change quickly: Remove the stale marker detection block from `isRunnableJob()` in `src/cron/service/timer.ts`
- Known bad symptoms: If the threshold is too aggressive, a legitimately slow-running job could have its marker cleared and be scheduled for a duplicate run
## Risks and Mitigations
- Risk: A job running longer than 2× its timeout could be incorrectly treated as stale.
- Mitigation: The 2× multiplier is conservative. Jobs that run 2× over their timeout are almost certainly stuck. The default timeout is already generous (5 minutes), so the stale threshold is 10+ minutes.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR adds auto-recovery for stale `runningAtMs` markers in cron jobs. When a job times out or crashes mid-execution, its `runningAtMs` marker persists, permanently blocking future runs. The fix detects markers exceeding 2× the job's timeout and clears them with a warning log.
The implementation adds stale marker detection in `isRunnableJob()` (`src/cron/service/timer.ts:363-381`). When `runningAtMs` duration exceeds the 2× timeout threshold, it's cleared and the job continues normal runnability checks. The change is conservative - the 2× multiplier provides safety margin so legitimately running jobs aren't interrupted.
The fix follows existing patterns: startup already clears stale markers in `ops.ts:40-46`, and `normalizeJobTickState()` has a 2-hour fallback at `jobs.ts:159`. This extends that recovery to runtime checks.
<h3>Confidence Score: 3/5</h3>
- This PR is relatively safe but has a persistence issue that could cause repeated warning logs
- The core logic is sound and well-tested (125 passing tests), but there's a persistence edge case identified in the previous thread that remains unresolved. The stale marker clearing may not persist when a job isn't due, causing the warning to repeat every timer tick (≤60s) until the job becomes due or the 2-hour `STUCK_RUN_MS` threshold in `normalizeJobTickState` kicks in. This doesn't break functionality but creates log noise.
- `src/cron/service/timer.ts` needs attention for the persistence issue with stale marker clearing
<sub>Last reviewed commit: 87c3ce0</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#17561: fix(cron): add runtime staleness guard for runningAtMs (#17554)
by robbyczgw-cla · 2026-02-15
90.8%
#17895: fix(cron): add staleness check for runningAtMs on manual trigger
by PlayerGhost · 2026-02-16
89.4%
#17664: fix(cron): detect and clear stale runningAtMs marker in manual run ...
by echoVic · 2026-02-16
88.1%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
88.1%
#17643: fix: clear stale runningAtMs in cron.run to allow manual triggers
by MisterGuy420 · 2026-02-16
87.3%
#12018: fix(cron): clear stale running markers based on job timeout
by benzer25 · 2026-02-08
86.6%
#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection
by namabile · 2026-02-17
86.0%
#5179: fix(cron): recover stale running markers
by thatdaveb · 2026-01-31
85.9%
#17949: fix: clear stale runningAtMs in cron.run() before already-running c...
by yasumorishima · 2026-02-16
85.0%
#17064: fix(cron): prevent control-plane starvation during startup catch-up...
by donggyu9208 · 2026-02-15
81.4%