#23683: docs: RFC for standalone task scheduler with implicit heartbeat

by amittell open 2026-02-22 15:52 View on GitHub →

docs size: XS

## Problem OpenClaw's built-in cron stores jobs in a flat JSON file. No run history, no stale detection, no overlap control, no failure recovery. The heartbeat system fires agent turns on a timer but has no concept of job lifecycle — if something hangs, you find out when you notice the silence. Specific gaps: - **No run history.** Job fires, result disappears. - **No stale detection.** Agent hangs mid-job? Only signal is silence until the next heartbeat. - **No overlap control.** Long-running jobs double-dispatch. - **No backoff.** Broken jobs fire every cycle, burning tokens. - **No inter-agent messaging.** Isolated sessions can't coordinate. - **Flat file storage.** No transactions, no queryability, no crash safety. ## Proposal Standalone scheduler process that replaces both built-in cron and heartbeat. Runs as a separate service alongside the gateway — dispatches via the existing chat completions API. Zero gateway code changes required. ### What it does | Capability | How | |---|---| | **Job storage** | SQLite (ACID, queryable, crash-safe) | | **Run tracking** | Every execution logged: status, duration, agent response, error detail | | **Implicit heartbeat** | Monitors session activity instead of flat timeout. A crashed 20s job is detected in ~90s. A legitimate 4min research task isn't killed at 5min. | | **Overlap control** | Per-job policy: skip / allow / queue | | **Failure backoff** | Exponential: 30s → 1min → 5min → 15min → 60min on consecutive failures | | **Job chains** | Parent/child relationships with trigger-on-success, depth limits | | **Inter-agent messaging** | Priority queue with threading, read receipts, TTL, delivery tracking | | **CLI** | Full job/run/message management without touching the DB directly | ### Architecture ``` Dispatcher Loop (10s tick) ├── Gateway health check ├── Find due jobs (next_run_at <= now, enabled=1) ├── Dispatch │ ├── main session → openclaw system event CLI │ └── isolated → POST /v1/chat/completions ├── Stale run detection (implicit heartbeat) ├── Deliver pending inter-agent messages └── Housekeeping (expire messages, prune old runs) ``` All state lives in a single SQLite database. Dispatcher is a Node.js process managed by launchd/systemd. Jobs are cron-scheduled or chain-triggered. ### Integration point One config change: `gateway.http.endpoints.chatCompletions.enabled: true` (already exists, disabled by default). Everything else uses public APIs — chat completions for dispatch, tool invocation for delivery, system events for main-session injection. No monkey-patching, no gateway forks, no internal API dependencies. ## Production status Running in production for 3+ weeks: - **91 unit tests passing** (v4, up from 47 at initial deploy) - 4 active scheduled jobs (morning status, workspace audit, hourly backup, ad-hoc) - Zero missed fires, zero undetected stale runs - Survived gateway restarts, scheduler restarts, and network blips without data loss v4 additions since initial deploy: retry logic with configurable max retries + delay, chain depth limits (prevent infinite recursion), chain cancellation (kill parent → kills children), delete-after-run for one-shot jobs. ## What I'm proposing Two paths, not mutually exclusive: 1. **Docs/reference** — publish the RFC as reference architecture for anyone who needs scheduling beyond flat-file cron. The pattern (SQLite + chat completions API + implicit heartbeat) is reusable. 2. **Native integration** — port the core concepts into built-in cron: run history table, implicit heartbeat, overlap policies, exponential backoff. The inter-agent messaging could stay external or become a first-class primitive. Either way, the implicit heartbeat model (infer liveness from session activity, not wall-clock timeout) is the key insight. It eliminates both false-positive kills on long tasks and delayed detection of actual crashes. Full RFC document: `docs/concepts/standalone-scheduler.md`