#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection
size: M
Cluster:
Cron Job Stability Fixes
## Summary
Fixes #18120
The scheduler's stuck-run detection previously used a blanket 2-hour threshold (`STUCK_RUN_MS`) regardless of the job's configured timeout. A job with a 5-minute timeout would remain blocked for up to 2 hours instead of being unblocked after ~15 minutes.
### Changes
- **`src/cron/service/jobs.ts`**: Replace the hardcoded `STUCK_RUN_MS` constant with a per-job `resolveStuckRunMs()` function that computes the threshold from `payload.timeoutSeconds` (+ 5 min buffer) when available, falling back to `DEFAULT_JOB_TIMEOUT_MS` (10 min) + buffer for jobs without an explicit timeout.
- **`src/cron/service.stuck-running-timeout.test.ts`**: 8 regression tests covering all branches of the new logic:
- agentTurn jobs with short timeout (5 min) — clear vs. retain
- agentTurn jobs with long timeout (30 min) — clear vs. retain
- agentTurn jobs without `timeoutSeconds` — fallback behavior
- systemEvent jobs (no timeout field) — fallback behavior
### How it works
```
resolveStuckRunMs(job):
if agentTurn && timeoutSeconds is set:
return timeoutSeconds * 1000 + STUCK_RUN_BUFFER_MS (5 min)
else:
return DEFAULT_JOB_TIMEOUT_MS (10 min) + STUCK_RUN_BUFFER_MS (5 min)
```
This ensures that a job with `timeoutSeconds: 300` (5 min) gets its `runningAtMs` cleared after 10 minutes instead of waiting 2 hours.
### Test plan
- [x] All 8 new regression tests pass
- [x] All 152 existing cron tests pass (no regressions)
- [x] `pnpm format:check` — clean
- [x] `pnpm tsgo` — clean
- [x] `pnpm lint` — 0 warnings, 0 errors
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR fixes a long-standing scheduler bug where stuck-run detection used a blanket 2-hour threshold regardless of a job's configured `timeoutSeconds`. A job with a 5-minute timeout would remain blocked for up to 2 hours instead of ~10 minutes.
The fix introduces `resolveStuckRunMs()` in `src/cron/service/jobs.ts`, which computes a per-job threshold from `payload.timeoutSeconds` (+ 5-minute buffer) for `agentTurn` jobs, and falls back to `DEFAULT_JOB_TIMEOUT_MS` (10 min) + buffer for jobs without an explicit timeout. This logic mirrors the execution timeout already applied in `timer.ts` and is consistent with how the scheduler itself enforces deadlines.
Key observations:
- The behavioral change is correct and well-targeted: the threshold is now proportional to the job's actual timeout.
- The new `stuckThresholdMs` field in the log payload improves observability.
- 8 regression tests cover all branches cleanly and confirm both the clear and retain cases for each job type.
- **One maintenance concern**: `DEFAULT_JOB_TIMEOUT_MS` is now defined independently in both `jobs.ts` and `timer.ts`. Both values are identical today (`10 * 60 * 1000`), but the duplication could silently diverge if the execution timeout in `timer.ts` is changed in the future. Exporting the constant from a single location would eliminate this risk.
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge; the fix is correct and well-tested, with one minor maintainability concern about a duplicated constant.
- The logic change is small, well-understood, and directly mirrors existing timeout handling already present in `timer.ts`. All 8 new regression tests pass, and the test coverage is thorough across all code paths. The only concern is the duplication of `DEFAULT_JOB_TIMEOUT_MS` between `jobs.ts` and `timer.ts`, which is a style/maintainability issue rather than a functional bug today.
- No files require special attention, though reviewers should note the duplicated `DEFAULT_JOB_TIMEOUT_MS` constant in `src/cron/service/jobs.ts` vs `src/cron/service/timer.ts`.
<sub>Last reviewed commit: 05608da</sub>
<!-- greptile_other_comments_section -->
<sub>(5/5) You can turn off certain types of comments like style [here](https://app.greptile.com/review/github)!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
88.2%
#17561: fix(cron): add runtime staleness guard for runningAtMs (#17554)
by robbyczgw-cla · 2026-02-15
87.8%
#17643: fix: clear stale runningAtMs in cron.run to allow manual triggers
by MisterGuy420 · 2026-02-16
86.1%
#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)
by BinHPdev · 2026-02-16
86.0%
#16880: fix(cron): respect per-job timeoutSeconds in executeJob path (#16841)
by echoVic · 2026-02-15
84.7%
#17895: fix(cron): add staleness check for runningAtMs on manual trigger
by PlayerGhost · 2026-02-16
84.5%
#17949: fix: clear stale runningAtMs in cron.run() before already-running c...
by yasumorishima · 2026-02-16
84.5%
#17664: fix(cron): detect and clear stale runningAtMs marker in manual run ...
by echoVic · 2026-02-16
84.2%
#12018: fix(cron): clear stale running markers based on job timeout
by benzer25 · 2026-02-08
84.0%
#12303: fix(cron): correct nextRunAtMs calculation and prevent timer stall
by colddonkey · 2026-02-09
83.1%