← Back to PRs

#8825: fix: prevent cron infinite retry loop with exponential backoff

by dbottme open 2026-02-04 13:03 View on GitHub →
stale
## Summary - Fixes one-shot cron jobs (`schedule.kind="at"`) getting stuck in infinite retry loops when they fail - Implements exponential backoff for failed jobs (30s → 60s → 2m → 4m → 8m... up to 1 hour max) - Auto-disables one-shot jobs after 5 consecutive failures to prevent API rate limit cooldowns ## Problem When a one-shot cron job failed, `computeJobNextRunAtMs()` returned the original scheduled time (now in the past), causing the job to immediately become due again. Each retry spawned a new isolated agent session and called the LLM API, rapidly triggering rate limit errors that froze the entire OpenClaw instance. ## Solution - Added `consecutiveFailures` field to `CronJobState` - Track failures in the `finish()` callback (reset on success, increment on error) - Apply exponential backoff: `30s * 2^failures` (capped at 1 hour) - Auto-disable after `MAX_CONSECUTIVE_FAILURES` (5) with a warning log ## Test plan - [ ] Create a one-shot cron job that will fail - [ ] Verify it retries with increasing delays instead of immediately - [ ] Verify it auto-disables after 5 failures - [ ] Verify successful jobs reset the failure counter Fixes #8520 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR adds failure tracking for one-shot cron jobs (`schedule.kind="at"`) to prevent immediate re-runs when the scheduled `atMs` is in the past after a failure. It introduces `consecutiveFailures` in `CronJobState`, increments/resets it in `executeJob()`’s `finish()` callback, and uses that counter in `computeJobNextRunAtMs()` to apply exponential backoff (capped at 1h). It also disables one-shot jobs after `MAX_CONSECUTIVE_FAILURES` to avoid infinite retry loops that can spam isolated sessions and trigger API rate limits. Primary concern: the backoff math currently makes the *first* retry 60s (not 30s), which doesn’t match the documented sequence and likely isn’t intended. Files touched: `src/cron/service/jobs.ts` (backoff computation + next-run calculation), `src/cron/service/timer.ts` (failure counting + auto-disable), `src/cron/types.ts` (state field). <h3>Confidence Score: 4/5</h3> - This PR is likely safe to merge, with one correctness issue in the backoff calculation worth fixing before release. - Changes are localized and address a real failure mode (one-shot jobs re-due immediately). The main functional risk is an off-by-one in the exponential backoff exponent causing longer-than-intended initial retry delay; the rest is straightforward state bookkeeping and gating. - src/cron/service/jobs.ts (retry backoff math), src/cron/service/timer.ts (error/state semantics when disabling) <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs