#16932: fix(cron): retry rename on EBUSY and fall back to copyFile on Windows
size: S
experienced-contributor
Cluster:
Cron Job Fixes
## Summary
- **Problem:** On Windows, `saveCronStore()` fails with `EBUSY: resource busy or locked` when multiple cron jobs complete near-simultaneously and both try to atomically update `cron/jobs.json` via write-temp-then-rename.
- **Why it matters:** Transient EBUSY causes missed cron state updates (lastRunAtMs, consecutiveErrors), potentially leading to duplicate or skipped runs.
- **What changed:** Added `renameWithRetry()` that retries up to 3 times with exponential backoff (50/100/200ms) on EBUSY, and falls back to `copyFile` + `unlink` on EPERM/EEXIST (matching existing pattern in `config/io.ts`).
- **What did NOT change (scope boundary):** Only the cron store file persistence is affected. No changes to job scheduling, execution, or in-memory state.
## Change Type (select all)
- [x] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Docs
- [ ] Security hardening
- [ ] Chore/infra
## Scope (select all touched areas)
- [ ] Gateway / orchestration
- [ ] Skills / tool execution
- [ ] Auth / tokens
- [x] Memory / storage
- [ ] Integrations
- [ ] API / contracts
- [ ] UI / DX
- [ ] CI/CD / infra
## Linked Issue/PR
- Closes #16842
## User-visible / Behavior Changes
None — transient EBUSY errors that previously caused state persistence failures are now transparently retried.
## Security Impact (required)
- New permissions/capabilities? `No`
- Secrets/tokens handling changed? `No`
- New/changed network calls? `No`
- Command/tool execution surface changed? `No`
- Data access scope changed? `No`
## Repro + Verification
### Environment
- OS: Windows 10 (NTFS)
- Runtime: Node.js v22+
- 5+ cron jobs with overlapping schedules
### Steps
1. Configure 5+ cron jobs with overlapping schedules (e.g. `every: 90000`)
2. Wait for multiple jobs to complete within the same second
3. Before fix: `EBUSY: resource busy or locked` errors in gateway log
4. After fix: Rename retries transparently, no EBUSY propagation
### Expected
- Cron state persists reliably even under concurrent access
### Actual
- Before: EBUSY causes state persistence failure
- After: Retry succeeds within ~200ms
## Evidence
- [x] Failing test/log before + passing after
Three new unit tests added in `store.test.ts`:
- `persists and round-trips a store file` — happy path
- `retries rename on EBUSY then succeeds` — mocks 2x EBUSY then succeeds on 3rd attempt
- `falls back to copyFile on EPERM (Windows)` — verifies Windows fallback path
## Human Verification (required)
- Verified scenarios: All 6 store tests pass (3 existing + 3 new)
- Edge cases checked: EBUSY retry exhaustion (re-throws), EPERM/EEXIST fallback, clean temp file removal
- What I did **not** verify: Actual Windows NTFS behavior (tested via mocked fs.rename)
## Compatibility / Migration
- Backward compatible? `Yes`
- Config/env changes? `No`
- Migration needed? `No`
## Failure Recovery (if this breaks)
- How to disable/revert this change quickly: Revert single commit
- Files/config to restore: `src/cron/store.ts`
- Known bad symptoms: If retry delays (50-200ms) cause issues under extreme load, reduce `RENAME_MAX_RETRIES`
## Risks and Mitigations
- Risk: Retry delays (up to ~350ms total) could briefly block cron operations under contention
- Mitigation: Delays are short (50/100/200ms) and only triggered by EBUSY which is already a blocking error
---
🤖 Generated with [Claude Code](https://claude.com/claude-code)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds `renameWithRetry()` to `saveCronStore()` to handle transient `EBUSY` errors on Windows when multiple cron jobs complete near-simultaneously and contend on `cron/jobs.json`. The function retries up to 3 times with exponential backoff (50/100/200ms) on `EBUSY`, and falls back to `copyFile` + `unlink` on `EPERM`/`EEXIST`, matching the existing pattern in `src/config/io.ts`.
- The retry logic is well-scoped — only the file persistence path is affected, with no changes to job scheduling or execution
- Three new unit tests cover the happy path, EBUSY retry, and EPERM fallback
- The EBUSY exhaustion case (all retries fail) correctly re-throws via the `throw err` at the end of the catch block
- Unlike `config/io.ts` which cleans up the temp file on non-recoverable errors (line 1026-1028), `renameWithRetry` does not clean up the temp file when it re-throws — this is minor since temp file names are unique, but could be improved for consistency
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — it adds a well-scoped retry mechanism that only affects file persistence with no behavioral changes to cron scheduling or execution.
- Score of 4 reflects a clean, focused bug fix with correct retry logic, good test coverage for the primary paths, and consistency with existing patterns in the codebase. Deducted 1 point for the minor temp file cleanup gap and missing exhaustion test case.
- No files require special attention. The changes are isolated to `src/cron/store.ts` with matching tests.
<sub>Last reviewed commit: 8ad5a0e</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#19484: Fix cron store backup churn
by guirguispierre · 2026-02-17
77.9%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
76.1%
#10829: fix: prevent cron scheduler permanent death on transient startup/ru...
by meaadore1221-afk · 2026-02-07
76.0%
#17064: fix(cron): prevent control-plane starvation during startup catch-up...
by donggyu9208 · 2026-02-15
75.6%
#20329: Fix cron.run WS blocking and harden delivery recovery
by guirguispierre · 2026-02-18
74.7%
#8825: fix: prevent cron infinite retry loop with exponential backoff
by dbottme · 2026-02-04
74.7%
#16888: fix(cron): execute missed jobs outside the lock to unblock list/sta...
by hou-rong · 2026-02-15
74.4%
#19372: fix(cron): normalize jobId → id for file-backed jobs
by namabile · 2026-02-17
74.2%
#14667: fix: preserve missed cron runs when updating job schedule
by WalterSumbon · 2026-02-12
74.1%
#18192: fix(cron): auto-clear stale runningAtMs markers after timeout (#18120)
by BinHPdev · 2026-02-16
74.0%