#11522: Fix #10904: Add hard timeout to lane tasks to prevent cron wedging

by divol89 open 2026-02-07 23:09 View on GitHub →

channel: signal channel: telegram app: web-ui gateway cli agents stale

## Problem The cron scheduler lane wedges when a task hangs indefinitely. The `state.active` counter never decrements, blocking all subsequent jobs. ## Root Cause Lane tasks execute without any timeout. If a cron job (e.g., isolated agent turn) gets stuck waiting for model response, exec completion, or network I/O, the lane remains "active" forever. ## Fix Add a 5-minute hard timeout via `Promise.race` to ensure wedged tasks fail with an error instead of blocking the lane forever. ## Changes - Added `TASK_TIMEOUT_MS = 300_000` constant (5 minutes) - Wrapped `entry.task()` in `Promise.race` with timeout - Tasks that exceed the timeout throw and decrement `state.active` Fixes #10904 Wallet: BYCgQQpJT1odaunfvk6gtm5hVd7Xu93vYwbumFfqgHb3  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR makes cron scheduling and related subsystems more robust by (1) adding a hard timeout around lane task execution to prevent the cron lane from wedging permanently, and (2) tightening/expanding a few configuration and delivery behaviors (cron delivery fields, optional provider baseUrl defaults, per-agent heartbeat model resolution, and some UI markdown performance limits). It also adjusts cron store/timer loading so the timer tick uses persisted `nextRunAtMs` for determining due jobs, then recomputes next runs after executing due jobs, and includes small fixes in Signal/Telegram/TTS/gateway plumbing. Overall direction is sound, but there are a couple of correctness issues that can affect runtime behavior (timer leak in the new lane timeout wrapper; and edit message deduplication producing `"undefined"` IDs). <h3>Confidence Score: 3/5</h3> - This PR is close to safe to merge but has a couple of concrete runtime issues to address first. - Most changes are straightforward and align with the stated goal, but the new lane timeout wrapper introduces an uncleared `setTimeout` per task (resource leak) and the Signal edit deduplication can emit a literal "undefined" messageId, which can break downstream dedupe. Fixing these should materially reduce risk. - src/process/command-queue.ts, src/signal/monitor/event-handler.ts