#17895: fix(cron): add staleness check for runningAtMs on manual trigger

by PlayerGhost open 2026-02-16 08:40 View on GitHub →

stale size: XS

## Summary Add staleness check for `runningAtMs` in cron `ops.ts:run()` to prevent jobs from being permanently blocked if a previous execution crashed without clearing the running marker. **Issue:** #17554 ## Change Type - [x] Bug fix (non-breaking change which fixes an issue) ## Scope - **Files changed:** `src/cron/service/ops.ts` - **Risk:** Low — adds safety check before existing logic ## Security Impact - No security implications — internal state management only ## Steps to Reproduce 1. A cron job starts executing (`runningAtMs` is set) 2. The process crashes mid-execution (marker not cleared) 3. User triggers `cron run <jobId>` manually 4. **Before fix:** Job refuses to run ("already running") indefinitely 5. **After fix:** Stale marker (>10min) is cleared, job executes normally ## Evidence - `pnpm build` ✅ - `pnpm check` (lint+format) ✅ - Targeted tests: 118 tests passed (26 files, cron service) on Windows host ✅ - Full baseline: 542 tests passed ✅ ## Human Verification - [x] Code reviewed by human contributor ## Compatibility - Backward compatible — only adds staleness detection - Uses same DEFAULT_JOB_TIMEOUT_MS (10min) as timer.ts ## Failure & Recovery - If staleness check is wrong (job actually running), worst case is a duplicate execution - 10min threshold matches existing timeout, minimizing false positives ## Risks - Minimal — conservative threshold prevents premature clearing --- 🤖 AI-assisted (Opus 4.6 implementation, reviewed by Opus 4.6 + GPT 5.3 Codex with thinking high)  <h3>Greptile Summary</h3> Adds staleness detection for `runningAtMs` markers in manual cron job triggers to prevent jobs from being permanently blocked after crashes. - Uses 2× job timeout (or 2× 10min default) as the staleness threshold, which is tighter than the 2-hour `STUCK_RUN_MS` used by automatic tick-based cleanup in `jobs.ts:118` - Correctly extracts `timeoutSeconds` from `agentTurn` payloads and falls back to `DEFAULT_JOB_TIMEOUT_MS` - Logs a warning before clearing stale markers, preserving observability - Maintains existing behavior for non-stale running jobs (returns `"already-running"`) <h3>Confidence Score: 5/5</h3> - Safe to merge - conservative threshold prevents false positives, worst-case duplicate execution is acceptable - The change adds defensive recovery logic with a conservative threshold (2× timeout) that aligns with existing patterns in the codebase. The implementation correctly handles job-specific timeouts and maintains backward compatibility. - No files require special attention <sub>Last reviewed commit: 710a4af</sub>