#22411: fix(cron): cancel timed-out runs before side effects
gateway
agents
maintainer
size: M
Cluster:
Cron Job Stability Fixes
## Summary
Describe the problem and fix in 2–5 bullets:
- Problem: Cron timer timeouts marked runs as failed, but isolated execution could still finish and emit downstream side effects (announce/main-session summary).
- Why it matters: This can produce inconsistent outcomes (error state + late delivery) and duplicate/confusing notifications.
- What changed: Added abort-signal plumbing from timer timeout into cron isolated execution path and guard checks before post-run side effects; wired signal through gateway cron service.
- What did NOT change (scope boundary): No scheduler concurrency behavior changes; no CLI/API surface changes beyond internal optional signal threading.
## Change Type (select all)
- [x] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Docs
- [ ] Security hardening
- [ ] Chore/infra
## Scope (select all touched areas)
- [x] Gateway / orchestration
- [x] Skills / tool execution
- [ ] Auth / tokens
- [ ] Memory / storage
- [ ] Integrations
- [ ] API / contracts
- [ ] UI / DX
- [ ] CI/CD / infra
## Linked Issue/PR
- Closes #
- Related #20217
- Related #18040
- Related #18120
- Related #16841
- Related PR #16880
## User-visible / Behavior Changes
- Timed-out cron isolated runs are now prevented from emitting late follow-up side effects after timeout.
- Reduces cases where users see timeout/error while delayed follow-up delivery still appears later.
## Security Impact (required)
- New permissions/capabilities? (`Yes/No`) No
- Secrets/tokens handling changed? (`Yes/No`) No
- New/changed network calls? (`Yes/No`) No
- Command/tool execution surface changed? (`Yes/No`) No
- Data access scope changed? (`Yes/No`) No
- If any `Yes`, explain risk + mitigation:
## Repro + Verification
### Environment
- OS: macOS (darwin-arm64)
- Runtime/container: Node + Vitest
- Model/provider: N/A (unit/integration tests)
- Integration/channel (if any): Cron service internals
- Relevant config (redacted): default test config
### Steps
1. Create an isolated cron job with a very small timeout and delayed runner completion.
2. Trigger scheduler tick.
3. Confirm timed-out status and absence of follow-up enqueue side effects.
### Expected
- Run is marked timed out/error and does not emit late follow-up side effects.
### Actual
- Matches expected after fix.
## Evidence
Attach at least one:
- [x] Failing test/log before + passing after
- [ ] Trace/log snippets
- [ ] Screenshot/recording
- [ ] Perf numbers (if relevant)
## Human Verification (required)
What you personally verified (not just CI), and how:
- Verified scenarios:
- `pnpm vitest run src/cron/service.issue-regressions.test.ts src/cron/service.rearm-timer-when-running.test.ts src/cron/service.restart-catchup.test.ts --maxWorkers=1`
- `pnpm vitest run src/cron/service.delivery-plan.test.ts src/cron/service.runs-one-shot-main-job-disables-it.test.ts --maxWorkers=1`
- Edge cases checked:
- timeoutSeconds=0 semantics remain intact
- existing timer rearm/restart catch-up behavior still passes
- What you did **not** verify:
- Full live provider/channel end-to-end with real network delivery
## Compatibility / Migration
- Backward compatible? (`Yes/No`) Yes
- Config/env changes? (`Yes/No`) No
- Migration needed? (`Yes/No`) No
- If yes, exact upgrade steps:
## Failure Recovery (if this breaks)
- How to disable/revert this change quickly:
- Revert this PR commit from cron timeout-cancellation branch.
- Files/config to restore:
- `src/cron/service/state.ts`, `src/cron/service/timer.ts`, `src/cron/isolated-agent/run.ts`, `src/gateway/server-cron.ts`, `src/cron/service.issue-regressions.test.ts`
- Known bad symptoms reviewers should watch for:
- Timed-out jobs still emitting late announce/main-session summaries.
## Risks and Mitigations
List only real risks for this PR. Add/remove entries as needed. If none, write `None`.
- Risk:
- Over-cancellation could suppress legitimate post-run side effects in borderline timeout races.
- Mitigation:
- Abort checks are limited to timeout path and covered with a dedicated regression test for side-effect suppression after timeout.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds `AbortSignal` plumbing through the cron execution stack to prevent timed-out jobs from emitting late side effects (system events, announces) after timeout. The timeout controller now aborts the signal when jobs exceed their `timeoutSeconds` limit, and abort checks guard the main post-run side effects (`enqueueSystemEvent`, `requestHeartbeatNow`) in `src/cron/service/timer.ts:554-581`.
**Key changes:**
- Threads optional `signal?: AbortSignal` parameter through `runIsolatedAgentJob` → `runCronIsolatedAgentTurn`
- Adds abort checks at two critical points in `isolated-agent/run.ts:514` (after agent execution) and `:574` (before side effects logic)
- Guards post-run side effects in `timer.ts:554` (before `enqueueSystemEvent`) and `:502` (in heartbeat loop)
- Normalizes timeout error messages via `timeoutErrorMessage()` helper
**Test coverage:**
- New regression test validates that `enqueueSystemEvent` is not called when timeout fires
- Existing timer/restart/delivery tests continue to pass
<h3>Confidence Score: 3/5</h3>
- Safe to merge with moderate caution - improves timeout handling but has edge case gaps
- The PR correctly implements abort signal plumbing and guards the critical post-run side effects (`enqueueSystemEvent`) in `timer.ts`. However, there are two potential race conditions in `isolated-agent/run.ts` where in-run delivery side effects (`deliverOutboundPayloads:654`, `runSubagentAnnounceFlow:747`) could execute after timeout if the timeout fires during their execution. The test only validates post-run behavior, not these in-run cases. This is a partial fix that significantly improves the situation but doesn't fully eliminate late deliveries.
- Pay close attention to `src/cron/isolated-agent/run.ts` lines 654 and 747 - verify whether in-run delivery side effects should be guarded with abort checks
<sub>Last reviewed commit: 76fd2f8</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#17064: fix(cron): prevent control-plane starvation during startup catch-up...
by donggyu9208 · 2026-02-15
80.3%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
80.0%
#19414: fix: respect job timeoutSeconds for stuck runningAtMs detection
by namabile · 2026-02-17
79.4%
#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)
by BinHPdev · 2026-02-16
79.3%
#3693: fix(cron): delete deleteAfterRun jobs regardless of execution status
by HirokiKobayashi-R · 2026-01-29
78.4%
#11522: Fix #10904: Add hard timeout to lane tasks to prevent cron wedging
by divol89 · 2026-02-07
78.3%
#10829: fix: prevent cron scheduler permanent death on transient startup/ru...
by meaadore1221-afk · 2026-02-07
78.2%
#22719: fix(agents): make subagent announce timeout configurable (restore 6...
by Valadon · 2026-02-21
78.0%
#18432: fix(agents): clear active run state immediately on embedded timeout
by BinHPdev · 2026-02-16
78.0%
#18191: fix(cron): prevent scheduler freeze during rapid create/delete cycl...
by BinHPdev · 2026-02-16
77.7%