#14430: Cron: anti-zombie scheduler recovery and in-flight job persistence
docs
gateway
stale
size: L
Cluster:
Cron Scheduler Improvements
#### Summary
Cron scheduler reliability: prevent and recover from a stuck (zombie) timer so one-shot reminders and recurring jobs keep running. When the event loop is blocked or `onTimer` throws, the scheduler now re-arms itself, and a 60s anti-zombie check re-initializes the timer if no tick completes. Stale in-flight jobs (`runningAtMs`) are cleared and re-enqueued so one-shot `--at` reminders are retried instead of dropped.
I currently use cron jobs pretty heavily for both news-gathering as well as calendar reminders; I got tired of dealing with issues on nearly a daily basis with failures of varying sorts.
If this helps out the OpenClaw codebase, that's fantastic. It's really helped me in my fork.
Code word: lobster-biscuit.
#### Use Cases
- Gateway runs for days; timer occasionally stops (e.g. event loop blocked). Jobs show "Next: Xm ago" but never run. User wants recovery without a full restart.
- One-shot reminder fires but the process freezes mid-delivery; after recovery, the reminder should run again instead of being lost.
#### Behavior Changes
- Re-arm timer in catch block when `onTimer` throws.
- Re-arm on `openclaw cron list` / `openclaw cron status` when timer is dead (zombie recovery).
- Watchdog timer (2.5 min) re-arms if main timer dies.
- Anti-zombie self-healing: if no timer tick completes within 60s, scheduler re-initializes and clears/re-enqueues stale in-flight jobs (`runningAtMs`).
- Per-job dynamic stuck threshold for `runningAtMs` based on job timeout; startup clears only obviously stale markers.
- Docs: [Cron stuck (zombie scheduler)](https://docs.openclaw.ai/automation/troubleshooting), [cron-jobs](https://docs.openclaw.ai/automation/cron-jobs), [gateway troubleshooting](https://docs.openclaw.ai/gateway/troubleshooting) updated with anti-zombie and in-flight recovery.
#### Existing Functionality Check
- [x] I searched the codebase for existing functionality. Searches performed:
- Cron scheduler and timer in `src/cron/service/`; no prior anti-zombie or in-flight persistence.
- Upstream does not have `src/cron/service.anti-zombie.test.ts` or the 60s check-in / watchdog logic.
#### Tests
- `src/cron/service.anti-zombie.test.ts`: re-init when no tick in 60s, no false positive when recent tick completed, stale vs fresh `runningAtMs` recovery.
- `src/cron/service.restart-catchup.test.ts`: startup clears only stale `runningAtMs`.
- `src/cron/service.every-jobs-fire.test.ts`, `src/cron/service/jobs.ts`: per-job stuck threshold and re-arm in catch.
- All 112 cron tests pass (`pnpm test -- src/cron/`).
#### Manual Testing (omit if N/A)
- Run gateway; `openclaw cron list` / `openclaw cron status` re-arms if timer was dead. Logs: `cron: anti-zombie: no tick in 60s, re-initializing scheduler`, `cron: anti-zombie: recovering stale-running job`, `cron: watchdog re-arming timer` when applicable.
### Prerequisites
- Node 22+, pnpm.
### Steps
1. `pnpm install && pnpm build`
2. Run gateway; add a one-shot or recurring cron job.
3. `openclaw cron list` / `openclaw cron status`; check logs for anti-zombie/watchdog messages if scheduler was stuck.
**Sign-Off**
- Models used: Cursor IDE models
- Submitter effort: Roughly a week's worth of using so far :)
- Agent notes: Scoped to cron/scheduler only; no FORK-CHANGES or other fork-only files.
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
These changes harden the cron scheduler against “zombie” timers and hung ticks by (a) re-arming the timer after tick failures, (b) adding a watchdog to re-arm when the main timer is missing, and (c) adding an anti-zombie check-in that reinitializes the scheduler if no tick completes within 60s and optionally recovers stale `runningAtMs` jobs. Startup behavior was also adjusted to be more conservative about clearing `runningAtMs`, and tests were added to cover anti-zombie and restart catch-up behavior.
Most logic lives in `src/cron/service/timer.ts` (anti-zombie + watchdog + tick bookkeeping) and `src/cron/service/ops.ts` (startup/status/list hooks), with supporting state additions in `src/cron/service/state.ts` and updated stuck-marker thresholds in `src/cron/service/jobs.ts`.
<h3>Confidence Score: 3/5</h3>
- Moderately safe, but has recovery-loop edge cases that could cause repeated churn or leave jobs stuck after restart.
- Core approach is reasonable and is covered by new tests, but there are two correctness issues: startup clears `runningAtMs` using a fixed 20-minute constant that can conflict with per-job timeout-based thresholds, and the anti-zombie watchdog can backlog async interval runs causing repeated reinitialization/log spam under lock contention or slow persistence.
- src/cron/service/ops.ts, src/cron/service/timer.ts
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#10829: fix: prevent cron scheduler permanent death on transient startup/ru...
by meaadore1221-afk · 2026-02-07
85.4%
#12303: fix(cron): correct nextRunAtMs calculation and prevent timer stall
by colddonkey · 2026-02-09
83.3%
#12086: fix(cron): ensure timer callback fires for scheduled jobs
by divol89 · 2026-02-08
81.6%
#8698: fix(cron): default enabled to true for new jobs
by emmick4 · 2026-02-04
80.1%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
79.9%
#8578: fix(cron): add failure limit and exponential backoff for isolated t...
by Baoxd123 · 2026-02-04
79.9%
#5179: fix(cron): recover stale running markers
by thatdaveb · 2026-01-31
79.8%
#10918: fix(cron): add tolerance for timer precision and skip due jobs in r...
by Cherwayway · 2026-02-07
79.7%
#12122: fix(cron): ensure timer callback fires for scheduled jobs
by divol89 · 2026-02-08
79.7%
#13065: fix(cron): Fix "every" schedule not re-arming after gateway restart
by trevorgordon981 · 2026-02-10
79.6%