#8418: fix: notify user after consecutive heartbeat/cron failures
stale
Cluster:
Heartbeat Functionality Improvements
## Summary
Fixes #8414. When heartbeat or cron runs fail repeatedly (e.g., due to provider billing issues, rate limits, or model errors), failures are logged to `gateway.err.log` but the user receives **zero notification**. This can persist for hours or days silently, defeating the purpose of heartbeat monitoring.
## Changes
### Core logic (`src/cron/service/timer.ts`)
- Track `consecutiveFailures` count in `CronJobState`
- After **3 consecutive failures**, send an alert via `enqueueSystemEvent` and trigger `requestHeartbeatNow` to deliver it
- **Throttle** subsequent notifications to once per hour to avoid spam
- Reset failure counter and notification timestamp on success
### Heartbeat runner (`src/infra/heartbeat-runner.ts`)
- Same pattern for heartbeat-specific runs: track failures in `SessionEntry`, notify via `deliverOutboundPayloads` after threshold
- Reset on success
### Types
- Added `consecutiveFailures` and `lastFailureNotificationAtMs` to `CronJobState` (`src/cron/types.ts`)
- Added same fields to `SessionEntry` (`src/config/sessions/types.ts`)
## Tests
New test file: `src/cron/service.failure-notifications.test.ts` (2 tests)
1. **Threshold + throttle**: Verifies no alert on failures 1-2, alert fires on failure 3, subsequent alerts are throttled for 1 hour, then fire again after throttle expires
2. **Reset on success**: Verifies counter and notification timestamp reset to zero after a successful run
All 54 existing cron tests continue to pass (11 test files).
## User-visible behavior
After 3 consecutive heartbeat/cron failures, the user sees:
```
Alert: Cron job "my-job" failed 3 times in a row. Last error: FailoverError: 402 ...
```
Subsequent alerts are throttled to once per hour. Counter resets on any successful run.
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR adds user-facing alerting for repeated cron/heartbeat failures. For cron jobs (`src/cron/service/timer.ts`), it tracks `consecutiveFailures` and emits a system event after 3 consecutive errors, throttling further alerts to once per hour and resetting counters on success. For heartbeat runs (`src/infra/heartbeat-runner.ts`), it implements a similar consecutive-failure counter in `SessionEntry` and delivers an outbound "Alert:" payload after the threshold, with the same hourly throttle and reset-on-success behavior. New tests cover the cron threshold/throttle and reset logic.
<h3>Confidence Score: 3/5</h3>
- Generally safe to merge, but heartbeat failure alerting may misbehave under concurrent runs.
- Core cron logic is straightforward and tested, but the heartbeat-runner stores failure counts/notification timestamps using snapshot reads plus multiple independent writes, which can race and produce duplicate alerts or inaccurate failure counts when overlapping heartbeats occur.
- src/infra/heartbeat-runner.ts
<!-- greptile_other_comments_section -->
<sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#21014: fix(cron): suppress main-session summary for HEARTBEAT_OK responses
by nickjlamb · 2026-02-19
81.5%
#5498: Cron: honor next-heartbeat
by sebslight · 2026-01-31
81.4%
#3335: Fixes cron jobs
by hkirat · 2026-01-28
80.8%
#20521: feat(heartbeat): inject active cron job summary into heartbeat prompt
by maximalmargin · 2026-02-19
80.8%
#12303: fix(cron): correct nextRunAtMs calculation and prevent timer stall
by colddonkey · 2026-02-09
80.4%
#11657: fix(cron): treat skipped heartbeat as ok for one-shot jobs
by DukeDeSouth · 2026-02-08
80.4%
#8698: fix(cron): default enabled to true for new jobs
by emmick4 · 2026-02-04
80.3%
#8578: fix(cron): add failure limit and exponential backoff for isolated t...
by Baoxd123 · 2026-02-04
80.1%
#12365: test(heartbeat): don't skip empty HEARTBEAT.md for cron wake events
by tyclaudius-ai · 2026-02-09
80.0%
#6522: fix(cron): deliver original message when agent response is heartbea...
by sidmohan0 · 2026-02-01
79.7%