← Back to PRs

#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection

by namabile open 2026-02-17 19:35 View on GitHub →
size: M
## Summary Fixes #18120 The scheduler's stuck-run detection previously used a blanket 2-hour threshold (`STUCK_RUN_MS`) regardless of the job's configured timeout. A job with a 5-minute timeout would remain blocked for up to 2 hours instead of being unblocked after ~15 minutes. ### Changes - **`src/cron/service/jobs.ts`**: Replace the hardcoded `STUCK_RUN_MS` constant with a per-job `resolveStuckRunMs()` function that computes the threshold from `payload.timeoutSeconds` (+ 5 min buffer) when available, falling back to `DEFAULT_JOB_TIMEOUT_MS` (10 min) + buffer for jobs without an explicit timeout. - **`src/cron/service.stuck-running-timeout.test.ts`**: 8 regression tests covering all branches of the new logic: - agentTurn jobs with short timeout (5 min) — clear vs. retain - agentTurn jobs with long timeout (30 min) — clear vs. retain - agentTurn jobs without `timeoutSeconds` — fallback behavior - systemEvent jobs (no timeout field) — fallback behavior ### How it works ``` resolveStuckRunMs(job): if agentTurn && timeoutSeconds is set: return timeoutSeconds * 1000 + STUCK_RUN_BUFFER_MS (5 min) else: return DEFAULT_JOB_TIMEOUT_MS (10 min) + STUCK_RUN_BUFFER_MS (5 min) ``` This ensures that a job with `timeoutSeconds: 300` (5 min) gets its `runningAtMs` cleared after 10 minutes instead of waiting 2 hours. ### Test plan - [x] All 8 new regression tests pass - [x] All 152 existing cron tests pass (no regressions) - [x] `pnpm format:check` — clean - [x] `pnpm tsgo` — clean - [x] `pnpm lint` — 0 warnings, 0 errors <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR fixes a long-standing scheduler bug where stuck-run detection used a blanket 2-hour threshold regardless of a job's configured `timeoutSeconds`. A job with a 5-minute timeout would remain blocked for up to 2 hours instead of ~10 minutes. The fix introduces `resolveStuckRunMs()` in `src/cron/service/jobs.ts`, which computes a per-job threshold from `payload.timeoutSeconds` (+ 5-minute buffer) for `agentTurn` jobs, and falls back to `DEFAULT_JOB_TIMEOUT_MS` (10 min) + buffer for jobs without an explicit timeout. This logic mirrors the execution timeout already applied in `timer.ts` and is consistent with how the scheduler itself enforces deadlines. Key observations: - The behavioral change is correct and well-targeted: the threshold is now proportional to the job's actual timeout. - The new `stuckThresholdMs` field in the log payload improves observability. - 8 regression tests cover all branches cleanly and confirm both the clear and retain cases for each job type. - **One maintenance concern**: `DEFAULT_JOB_TIMEOUT_MS` is now defined independently in both `jobs.ts` and `timer.ts`. Both values are identical today (`10 * 60 * 1000`), but the duplication could silently diverge if the execution timeout in `timer.ts` is changed in the future. Exporting the constant from a single location would eliminate this risk. <h3>Confidence Score: 4/5</h3> - This PR is safe to merge; the fix is correct and well-tested, with one minor maintainability concern about a duplicated constant. - The logic change is small, well-understood, and directly mirrors existing timeout handling already present in `timer.ts`. All 8 new regression tests pass, and the test coverage is thorough across all code paths. The only concern is the duplication of `DEFAULT_JOB_TIMEOUT_MS` between `jobs.ts` and `timer.ts`, which is a style/maintainability issue rather than a functional bug today. - No files require special attention, though reviewers should note the duplicated `DEFAULT_JOB_TIMEOUT_MS` constant in `src/cron/service/jobs.ts` vs `src/cron/service/timer.ts`. <sub>Last reviewed commit: 05608da</sub> <!-- greptile_other_comments_section --> <sub>(5/5) You can turn off certain types of comments like style [here](https://app.greptile.com/review/github)!</sub> <!-- /greptile_comment -->

Most Similar PRs