#20329: Fix cron.run WS blocking and harden delivery recovery
gateway
size: M
Cluster:
Cron Job Stability Fixes
## Summary
This PR fixes a shared reliability issue across cron manual runs and delivery recovery: both paths could block on long/unresolvable async work and never make forward progress under certain failure conditions.
## Shared Root Cause
Both bugs came from recovery/dispatch paths that awaited work synchronously without a bounded/failsafe progression path:
- `cron.run` WS handler awaited full job execution before sending a response, so long runs hit gateway client timeout (#20288).
- delivery recovery could keep deferring the same entries forever when backoff exceeded the startup recovery budget, and malformed/invalid entries were not proactively quarantined (#20037).
- this also caused `delivery.mode=announce` effects to look silent from cron runs (#20300) because announce-related queue entries could get stuck in recovery loops.
## What Changed
### 1) `cron.run` now acks immediately
- File: `src/gateway/server-methods/cron.ts`
- `cron.run` now sends an immediate WS ack payload (`accepted: true`) and runs the job in background.
- background failures are logged through gateway logger instead of blocking the WS response path.
### 2) Delivery recovery no longer stalls indefinitely
- File: `src/infra/outbound/delivery-queue.ts`
- Hardened recovery with:
- malformed entry normalization/quarantine to `failed/`
- per-entry TTL expiry (`entryTtlMs`, default 24h)
- per-attempt timeout (`deliveryTimeoutMs`, default 15s)
- non-sticky backoff handling when backoff exceeds remaining recovery budget: record a failed attempt and increment retry count so entries eventually reach max retries instead of deferring forever
- immediate move to `failed/` once retries exceed `MAX_RETRIES`
## Tests
- `src/gateway/server.cron.e2e.test.ts`
- added: `cron.run` returns timely ack even when execution is still running
- `src/infra/outbound/outbound.test.ts`
- updated/added recovery tests for:
- backoff-over-budget increments failed attempts
- deferred retries eventually move to `failed/` at max retries
- TTL expiry handling
- malformed queue file quarantine
## Validation
- `pnpm exec vitest src/infra/outbound/outbound.test.ts`
- `pnpm exec vitest -c vitest.e2e.config.ts src/gateway/server.cron.e2e.test.ts`
Fixes #20037
Fixes #20288
Fixes #20300
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR fixes two related reliability issues: (1) the `cron.run` WS handler previously awaited full job execution before responding, causing client timeouts on long-running jobs — it now sends an immediate ack and runs the job in the background; (2) delivery recovery could stall indefinitely when backoff exceeded the recovery budget or when malformed queue entries caused infinite retry loops.
Key changes:
- **`cron.run` immediate ack**: Handler changed from `async` to synchronous, responding with `{ ok, accepted, jobId, mode }` immediately and executing the job via fire-and-forget with proper error logging.
- **Delivery recovery hardening**: Added entry normalization/quarantine for malformed files, per-entry TTL expiry (default 24h), per-attempt timeout (default 15s via `runWithTimeout`), and non-sticky backoff handling that increments retry count when backoff exceeds budget instead of deferring forever.
- **Test coverage**: New e2e test verifies `cron.run` acks within 1.5s even with a slow-running job. New unit tests cover backoff-over-budget retry increment, deferred-to-failed transition at max retries, TTL expiry, and malformed file quarantine.
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — the changes are well-scoped, properly tested, and address real reliability issues with sensible defaults.
- Score of 4 reflects solid implementation of both fixes with good test coverage. The cron.run change is clean and correct. The delivery recovery hardening is thorough with appropriate defaults. Minor concern: the `failed` and `skipped` counters can double-count the same entry (an entry increments `failed` for the retry attempt and `skipped` when moved to the failed directory), which is intentional per the tests but makes the returned stats sum to more than the number of entries processed. Additionally, `normalizeQueuedDeliveryForRecovery` casts `payloads` as `ReplyPayload[]` without validating individual array elements, relying on downstream handling for malformed payloads.
- `src/infra/outbound/delivery-queue.ts` deserves the most attention given the complexity of the new recovery logic and counter semantics.
<sub>Last reviewed commit: 7150c4b</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#17064: fix(cron): prevent control-plane starvation during startup catch-up...
by donggyu9208 · 2026-02-15
81.4%
#18144: fix(cron): clear stuck runningAtMs after timeout and add maintenanc...
by taw0002 · 2026-02-16
80.8%
#13065: fix(cron): Fix "every" schedule not re-arming after gateway restart
by trevorgordon981 · 2026-02-10
79.4%
#18743: Cron Tool Hardening: Normalize Gateway Params and Enforce Valid Sch...
by cccat6 · 2026-02-17
79.2%
#22385: fix: improve delivery recovery logging with entry age and deferral ...
by derrickburns · 2026-02-21
78.8%
#8698: fix(cron): default enabled to true for new jobs
by emmick4 · 2026-02-04
78.7%
#21014: fix(cron): suppress main-session summary for HEARTBEAT_OK responses
by nickjlamb · 2026-02-19
78.6%
#5179: fix(cron): recover stale running markers
by thatdaveb · 2026-01-31
78.5%
#9060: Fix: Preserve scheduled cron jobs after gateway restart
by vishaltandale00 · 2026-02-04
78.4%
#13055: fix: prevent cron RPC stalls with timeout and caching (#13018)
by trevorgordon981 · 2026-02-10
78.3%