← Back to PRs

#8418: fix: notify user after consecutive heartbeat/cron failures

by liaosvcaf open 2026-02-04 01:08 View on GitHub →
stale
## Summary Fixes #8414. When heartbeat or cron runs fail repeatedly (e.g., due to provider billing issues, rate limits, or model errors), failures are logged to `gateway.err.log` but the user receives **zero notification**. This can persist for hours or days silently, defeating the purpose of heartbeat monitoring. ## Changes ### Core logic (`src/cron/service/timer.ts`) - Track `consecutiveFailures` count in `CronJobState` - After **3 consecutive failures**, send an alert via `enqueueSystemEvent` and trigger `requestHeartbeatNow` to deliver it - **Throttle** subsequent notifications to once per hour to avoid spam - Reset failure counter and notification timestamp on success ### Heartbeat runner (`src/infra/heartbeat-runner.ts`) - Same pattern for heartbeat-specific runs: track failures in `SessionEntry`, notify via `deliverOutboundPayloads` after threshold - Reset on success ### Types - Added `consecutiveFailures` and `lastFailureNotificationAtMs` to `CronJobState` (`src/cron/types.ts`) - Added same fields to `SessionEntry` (`src/config/sessions/types.ts`) ## Tests New test file: `src/cron/service.failure-notifications.test.ts` (2 tests) 1. **Threshold + throttle**: Verifies no alert on failures 1-2, alert fires on failure 3, subsequent alerts are throttled for 1 hour, then fire again after throttle expires 2. **Reset on success**: Verifies counter and notification timestamp reset to zero after a successful run All 54 existing cron tests continue to pass (11 test files). ## User-visible behavior After 3 consecutive heartbeat/cron failures, the user sees: ``` Alert: Cron job "my-job" failed 3 times in a row. Last error: FailoverError: 402 ... ``` Subsequent alerts are throttled to once per hour. Counter resets on any successful run. <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR adds user-facing alerting for repeated cron/heartbeat failures. For cron jobs (`src/cron/service/timer.ts`), it tracks `consecutiveFailures` and emits a system event after 3 consecutive errors, throttling further alerts to once per hour and resetting counters on success. For heartbeat runs (`src/infra/heartbeat-runner.ts`), it implements a similar consecutive-failure counter in `SessionEntry` and delivers an outbound "Alert:" payload after the threshold, with the same hourly throttle and reset-on-success behavior. New tests cover the cron threshold/throttle and reset logic. <h3>Confidence Score: 3/5</h3> - Generally safe to merge, but heartbeat failure alerting may misbehave under concurrent runs. - Core cron logic is straightforward and tested, but the heartbeat-runner stores failure counts/notification timestamps using snapshot reads plus multiple independent writes, which can race and produce duplicate alerts or inaccurate failure counts when overlapping heartbeats occur. - src/infra/heartbeat-runner.ts <!-- greptile_other_comments_section --> <sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub> <!-- /greptile_comment -->

Most Similar PRs