#14430: Cron: anti-zombie scheduler recovery and in-flight job persistence

by philga7 open 2026-02-12 03:59 View on GitHub →

docs gateway stale size: L

#### Summary Cron scheduler reliability: prevent and recover from a stuck (zombie) timer so one-shot reminders and recurring jobs keep running. When the event loop is blocked or `onTimer` throws, the scheduler now re-arms itself, and a 60s anti-zombie check re-initializes the timer if no tick completes. Stale in-flight jobs (`runningAtMs`) are cleared and re-enqueued so one-shot `--at` reminders are retried instead of dropped. I currently use cron jobs pretty heavily for both news-gathering as well as calendar reminders; I got tired of dealing with issues on nearly a daily basis with failures of varying sorts. If this helps out the OpenClaw codebase, that's fantastic. It's really helped me in my fork. Code word: lobster-biscuit. #### Use Cases - Gateway runs for days; timer occasionally stops (e.g. event loop blocked). Jobs show "Next: Xm ago" but never run. User wants recovery without a full restart. - One-shot reminder fires but the process freezes mid-delivery; after recovery, the reminder should run again instead of being lost. #### Behavior Changes - Re-arm timer in catch block when `onTimer` throws. - Re-arm on `openclaw cron list` / `openclaw cron status` when timer is dead (zombie recovery). - Watchdog timer (2.5 min) re-arms if main timer dies. - Anti-zombie self-healing: if no timer tick completes within 60s, scheduler re-initializes and clears/re-enqueues stale in-flight jobs (`runningAtMs`). - Per-job dynamic stuck threshold for `runningAtMs` based on job timeout; startup clears only obviously stale markers. - Docs: [Cron stuck (zombie scheduler)](https://docs.openclaw.ai/automation/troubleshooting), [cron-jobs](https://docs.openclaw.ai/automation/cron-jobs), [gateway troubleshooting](https://docs.openclaw.ai/gateway/troubleshooting) updated with anti-zombie and in-flight recovery. #### Existing Functionality Check - [x] I searched the codebase for existing functionality. Searches performed: - Cron scheduler and timer in `src/cron/service/`; no prior anti-zombie or in-flight persistence. - Upstream does not have `src/cron/service.anti-zombie.test.ts` or the 60s check-in / watchdog logic. #### Tests - `src/cron/service.anti-zombie.test.ts`: re-init when no tick in 60s, no false positive when recent tick completed, stale vs fresh `runningAtMs` recovery. - `src/cron/service.restart-catchup.test.ts`: startup clears only stale `runningAtMs`. - `src/cron/service.every-jobs-fire.test.ts`, `src/cron/service/jobs.ts`: per-job stuck threshold and re-arm in catch. - All 112 cron tests pass (`pnpm test -- src/cron/`). #### Manual Testing (omit if N/A) - Run gateway; `openclaw cron list` / `openclaw cron status` re-arms if timer was dead. Logs: `cron: anti-zombie: no tick in 60s, re-initializing scheduler`, `cron: anti-zombie: recovering stale-running job`, `cron: watchdog re-arming timer` when applicable. ### Prerequisites - Node 22+, pnpm. ### Steps 1. `pnpm install && pnpm build` 2. Run gateway; add a one-shot or recurring cron job. 3. `openclaw cron list` / `openclaw cron status`; check logs for anti-zombie/watchdog messages if scheduler was stuck. **Sign-Off** - Models used: Cursor IDE models - Submitter effort: Roughly a week's worth of using so far :) - Agent notes: Scoped to cron/scheduler only; no FORK-CHANGES or other fork-only files.  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> These changes harden the cron scheduler against “zombie” timers and hung ticks by (a) re-arming the timer after tick failures, (b) adding a watchdog to re-arm when the main timer is missing, and (c) adding an anti-zombie check-in that reinitializes the scheduler if no tick completes within 60s and optionally recovers stale `runningAtMs` jobs. Startup behavior was also adjusted to be more conservative about clearing `runningAtMs`, and tests were added to cover anti-zombie and restart catch-up behavior. Most logic lives in `src/cron/service/timer.ts` (anti-zombie + watchdog + tick bookkeeping) and `src/cron/service/ops.ts` (startup/status/list hooks), with supporting state additions in `src/cron/service/state.ts` and updated stuck-marker thresholds in `src/cron/service/jobs.ts`. <h3>Confidence Score: 3/5</h3> - Moderately safe, but has recovery-loop edge cases that could cause repeated churn or leave jobs stuck after restart. - Core approach is reasonable and is covered by new tests, but there are two correctness issues: startup clears `runningAtMs` using a fixed 20-minute constant that can conflict with per-job timeout-based thresholds, and the anti-zombie watchdog can backlog async interval runs causing repeated reinitialization/log spam under lock contention or slow persistence. - src/cron/service/ops.ts, src/cron/service/timer.ts  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>