#20329: Fix cron.run WS blocking and harden delivery recovery

by guirguispierre open 2026-02-18 20:04 View on GitHub →

gateway size: M

## Summary This PR fixes a shared reliability issue across cron manual runs and delivery recovery: both paths could block on long/unresolvable async work and never make forward progress under certain failure conditions. ## Shared Root Cause Both bugs came from recovery/dispatch paths that awaited work synchronously without a bounded/failsafe progression path: - `cron.run` WS handler awaited full job execution before sending a response, so long runs hit gateway client timeout (#20288). - delivery recovery could keep deferring the same entries forever when backoff exceeded the startup recovery budget, and malformed/invalid entries were not proactively quarantined (#20037). - this also caused `delivery.mode=announce` effects to look silent from cron runs (#20300) because announce-related queue entries could get stuck in recovery loops. ## What Changed ### 1) `cron.run` now acks immediately - File: `src/gateway/server-methods/cron.ts` - `cron.run` now sends an immediate WS ack payload (`accepted: true`) and runs the job in background. - background failures are logged through gateway logger instead of blocking the WS response path. ### 2) Delivery recovery no longer stalls indefinitely - File: `src/infra/outbound/delivery-queue.ts` - Hardened recovery with: - malformed entry normalization/quarantine to `failed/` - per-entry TTL expiry (`entryTtlMs`, default 24h) - per-attempt timeout (`deliveryTimeoutMs`, default 15s) - non-sticky backoff handling when backoff exceeds remaining recovery budget: record a failed attempt and increment retry count so entries eventually reach max retries instead of deferring forever - immediate move to `failed/` once retries exceed `MAX_RETRIES` ## Tests - `src/gateway/server.cron.e2e.test.ts` - added: `cron.run` returns timely ack even when execution is still running - `src/infra/outbound/outbound.test.ts` - updated/added recovery tests for: - backoff-over-budget increments failed attempts - deferred retries eventually move to `failed/` at max retries - TTL expiry handling - malformed queue file quarantine ## Validation - `pnpm exec vitest src/infra/outbound/outbound.test.ts` - `pnpm exec vitest -c vitest.e2e.config.ts src/gateway/server.cron.e2e.test.ts` Fixes #20037 Fixes #20288 Fixes #20300  <h3>Greptile Summary</h3> This PR fixes two related reliability issues: (1) the `cron.run` WS handler previously awaited full job execution before responding, causing client timeouts on long-running jobs — it now sends an immediate ack and runs the job in the background; (2) delivery recovery could stall indefinitely when backoff exceeded the recovery budget or when malformed queue entries caused infinite retry loops. Key changes: - **`cron.run` immediate ack**: Handler changed from `async` to synchronous, responding with `{ ok, accepted, jobId, mode }` immediately and executing the job via fire-and-forget with proper error logging. - **Delivery recovery hardening**: Added entry normalization/quarantine for malformed files, per-entry TTL expiry (default 24h), per-attempt timeout (default 15s via `runWithTimeout`), and non-sticky backoff handling that increments retry count when backoff exceeds budget instead of deferring forever. - **Test coverage**: New e2e test verifies `cron.run` acks within 1.5s even with a slow-running job. New unit tests cover backoff-over-budget retry increment, deferred-to-failed transition at max retries, TTL expiry, and malformed file quarantine. <h3>Confidence Score: 4/5</h3> - This PR is safe to merge — the changes are well-scoped, properly tested, and address real reliability issues with sensible defaults. - Score of 4 reflects solid implementation of both fixes with good test coverage. The cron.run change is clean and correct. The delivery recovery hardening is thorough with appropriate defaults. Minor concern: the `failed` and `skipped` counters can double-count the same entry (an entry increments `failed` for the retry attempt and `skipped` when moved to the failed directory), which is intentional per the tests but makes the returned stats sum to more than the number of entries processed. Additionally, `normalizeQueuedDeliveryForRecovery` casts `payloads` as `ReplyPayload[]` without validating individual array elements, relying on downstream handling for malformed payloads. - `src/infra/outbound/delivery-queue.ts` deserves the most attention given the complexity of the new recovery logic and counter semantics. <sub>Last reviewed commit: 7150c4b</sub>