#23683: docs: RFC for standalone task scheduler with implicit heartbeat
docs
size: XS
Cluster:
Cron Scheduler Improvements
## Problem
OpenClaw's built-in cron stores jobs in a flat JSON file. No run history, no stale detection, no overlap control, no failure recovery. The heartbeat system fires agent turns on a timer but has no concept of job lifecycle — if something hangs, you find out when you notice the silence.
Specific gaps:
- **No run history.** Job fires, result disappears.
- **No stale detection.** Agent hangs mid-job? Only signal is silence until the next heartbeat.
- **No overlap control.** Long-running jobs double-dispatch.
- **No backoff.** Broken jobs fire every cycle, burning tokens.
- **No inter-agent messaging.** Isolated sessions can't coordinate.
- **Flat file storage.** No transactions, no queryability, no crash safety.
## Proposal
Standalone scheduler process that replaces both built-in cron and heartbeat. Runs as a separate service alongside the gateway — dispatches via the existing chat completions API. Zero gateway code changes required.
### What it does
| Capability | How |
|---|---|
| **Job storage** | SQLite (ACID, queryable, crash-safe) |
| **Run tracking** | Every execution logged: status, duration, agent response, error detail |
| **Implicit heartbeat** | Monitors session activity instead of flat timeout. A crashed 20s job is detected in ~90s. A legitimate 4min research task isn't killed at 5min. |
| **Overlap control** | Per-job policy: skip / allow / queue |
| **Failure backoff** | Exponential: 30s → 1min → 5min → 15min → 60min on consecutive failures |
| **Job chains** | Parent/child relationships with trigger-on-success, depth limits |
| **Inter-agent messaging** | Priority queue with threading, read receipts, TTL, delivery tracking |
| **CLI** | Full job/run/message management without touching the DB directly |
### Architecture
```
Dispatcher Loop (10s tick)
├── Gateway health check
├── Find due jobs (next_run_at <= now, enabled=1)
├── Dispatch
│ ├── main session → openclaw system event CLI
│ └── isolated → POST /v1/chat/completions
├── Stale run detection (implicit heartbeat)
├── Deliver pending inter-agent messages
└── Housekeeping (expire messages, prune old runs)
```
All state lives in a single SQLite database. Dispatcher is a Node.js process managed by launchd/systemd. Jobs are cron-scheduled or chain-triggered.
### Integration point
One config change: `gateway.http.endpoints.chatCompletions.enabled: true` (already exists, disabled by default). Everything else uses public APIs — chat completions for dispatch, tool invocation for delivery, system events for main-session injection.
No monkey-patching, no gateway forks, no internal API dependencies.
## Production status
Running in production for 3+ weeks:
- **91 unit tests passing** (v4, up from 47 at initial deploy)
- 4 active scheduled jobs (morning status, workspace audit, hourly backup, ad-hoc)
- Zero missed fires, zero undetected stale runs
- Survived gateway restarts, scheduler restarts, and network blips without data loss
v4 additions since initial deploy: retry logic with configurable max retries + delay, chain depth limits (prevent infinite recursion), chain cancellation (kill parent → kills children), delete-after-run for one-shot jobs.
## What I'm proposing
Two paths, not mutually exclusive:
1. **Docs/reference** — publish the RFC as reference architecture for anyone who needs scheduling beyond flat-file cron. The pattern (SQLite + chat completions API + implicit heartbeat) is reusable.
2. **Native integration** — port the core concepts into built-in cron: run history table, implicit heartbeat, overlap policies, exponential backoff. The inter-agent messaging could stay external or become a first-class primitive.
Either way, the implicit heartbeat model (infer liveness from session activity, not wall-clock timeout) is the key insight. It eliminates both false-positive kills on long tasks and delayed detection of actual crashes.
Full RFC document: `docs/concepts/standalone-scheduler.md`
Most Similar PRs
#20521: feat(heartbeat): inject active cron job summary into heartbeat prompt
by maximalmargin · 2026-02-19
68.3%
#14430: Cron: anti-zombie scheduler recovery and in-flight job persistence
by philga7 · 2026-02-12
67.2%
#12234: gateway: incident tracking, recover command, and ciao ERR_SERVER_CL...
by levineam · 2026-02-09
66.9%
#23431: feat(cron): add deferWhileActive to skip jobs during active sessions
by Dave-Pataky · 2026-02-22
66.5%
#22102: fix(cron): default isolated jobs to fresh sessions with sessionReus...
by k-velorum · 2026-02-20
66.2%
#23707: docs(cron): proposal for cron reliability plane
by tkuehnl · 2026-02-22
65.9%
#17529: feat(cron): add preCheck gate to skip jobs when nothing changed
by scottgl9 · 2026-02-15
65.4%
#20398: docs(automation): add multi-hop improvement loop pattern to cron-vs...
by skylinehk · 2026-02-18
64.6%
#7350: fix(cron): pass agentId and AccountId through heartbeat chain for m...
by codeslayer44 · 2026-02-02
64.5%
#9184: Fix: Heartbeat timer not resuming after macOS sleep/wake cycle
by vishaltandale00 · 2026-02-05
64.4%