#17895: fix(cron): add staleness check for runningAtMs on manual trigger
stale
size: XS
Cluster:
Cron Job Stability Fixes
## Summary
Add staleness check for `runningAtMs` in cron `ops.ts:run()` to prevent jobs from being permanently blocked if a previous execution crashed without clearing the running marker.
**Issue:** #17554
## Change Type
- [x] Bug fix (non-breaking change which fixes an issue)
## Scope
- **Files changed:** `src/cron/service/ops.ts`
- **Risk:** Low — adds safety check before existing logic
## Security Impact
- No security implications — internal state management only
## Steps to Reproduce
1. A cron job starts executing (`runningAtMs` is set)
2. The process crashes mid-execution (marker not cleared)
3. User triggers `cron run <jobId>` manually
4. **Before fix:** Job refuses to run ("already running") indefinitely
5. **After fix:** Stale marker (>10min) is cleared, job executes normally
## Evidence
- `pnpm build` ✅
- `pnpm check` (lint+format) ✅
- Targeted tests: 118 tests passed (26 files, cron service) on Windows host ✅
- Full baseline: 542 tests passed ✅
## Human Verification
- [x] Code reviewed by human contributor
## Compatibility
- Backward compatible — only adds staleness detection
- Uses same DEFAULT_JOB_TIMEOUT_MS (10min) as timer.ts
## Failure & Recovery
- If staleness check is wrong (job actually running), worst case is a duplicate execution
- 10min threshold matches existing timeout, minimizing false positives
## Risks
- Minimal — conservative threshold prevents premature clearing
---
🤖 AI-assisted (Opus 4.6 implementation, reviewed by Opus 4.6 + GPT 5.3 Codex with thinking high)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds staleness detection for `runningAtMs` markers in manual cron job triggers to prevent jobs from being permanently blocked after crashes.
- Uses 2× job timeout (or 2× 10min default) as the staleness threshold, which is tighter than the 2-hour `STUCK_RUN_MS` used by automatic tick-based cleanup in `jobs.ts:118`
- Correctly extracts `timeoutSeconds` from `agentTurn` payloads and falls back to `DEFAULT_JOB_TIMEOUT_MS`
- Logs a warning before clearing stale markers, preserving observability
- Maintains existing behavior for non-stale running jobs (returns `"already-running"`)
<h3>Confidence Score: 5/5</h3>
- Safe to merge - conservative threshold prevents false positives, worst-case duplicate execution is acceptable
- The change adds defensive recovery logic with a conservative threshold (2× timeout) that aligns with existing patterns in the codebase. The implementation correctly handles job-specific timeouts and maintains backward compatibility.
- No files require special attention
<sub>Last reviewed commit: 710a4af</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#17664: fix(cron): detect and clear stale runningAtMs marker in manual run ...
by echoVic · 2026-02-16
92.6%
#17643: fix: clear stale runningAtMs in cron.run to allow manual triggers
by MisterGuy420 · 2026-02-16
91.8%
#17561: fix(cron): add runtime staleness guard for runningAtMs (#17554)
by robbyczgw-cla · 2026-02-15
90.6%
#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)
by BinHPdev · 2026-02-16
89.4%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
87.9%
#12018: fix(cron): clear stale running markers based on job timeout
by benzer25 · 2026-02-08
85.6%
#17949: fix: clear stale runningAtMs in cron.run() before already-running c...
by yasumorishima · 2026-02-16
84.6%
#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection
by namabile · 2026-02-17
84.5%
#5179: fix(cron): recover stale running markers
by thatdaveb · 2026-01-31
82.6%
#17064: fix(cron): prevent control-plane starvation during startup catch-up...
by donggyu9208 · 2026-02-15
81.2%