← Back to PRs

#20341: feat(heartbeat): add circuit breaker for consecutive failures

by npmisantosh open 2026-02-18 20:18 View on GitHub →
size: M
# Implementation: Circuit Breaker with Exponential Backoff for Heartbeat Implements a circuit breaker pattern with exponential backoff to prevent infinite retry loops when heartbeats consistently fail. The circuit opens after 5 failures and resets after a 5-minute cooldown, with backoff scaling up to 32x the base interval. **Summary** Describe the problem and fix in 2–5 bullets: * **Problem:** Heartbeat runner would infinitely retry on persistent failures (API down, auth errors, network issues), wasting resources and flooding logs. * **Why it matters:** Prevents runaway API costs, reduces noise in monitoring, and allows the system to gracefully degrade when dependencies fail. * **What changed:** Added circuit breaker state (`consecutiveFailures`, `lastFailureMs`, `circuitOpen`) to `HeartbeatAgentState` with exponential backoff scheduling. * **What did NOT change (scope boundary):** No changes to other files; heartbeat interface remains backward compatible; no changes to success-path logic. **Change Type (select all)** - [ ] Bug fix - [x] Feature - [ ] Refactor - [ ] Docs - [x] Security hardening - [ ] Chore/infra **Scope (select all touched areas)** - [x] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra **Linked Issue/PR** - Closes #19884 - Related: N/A **User-visible / Behavior Changes** List user-visible changes (including defaults/config). If none, write `None`. * **New log messages** when circuit breaker opens/closes: `heartbeat: circuit breaker opened for {agentId} after {N} failures, next retry in {X}s` * **New skip reason** visible in logs: `circuit-open` when heartbeat is skipped due to an open circuit. * **No config changes required:** Fully automatic behavior. **Security Impact (required)** * New permissions/capabilities? (**No**) * Secrets/tokens handling changed? (**No**) * New/changed network calls? (**No**) * Command/tool execution surface changed? (**No**) * Data access scope changed? (**No**) **Repro + Verification** **Environment** * **OS:** Linux/macOS * **Runtime/container:** Node.js 20+ * **Model/provider:** Any (tested with OpenAI) * **Integration/channel (if any):** Slack, Discord * **Relevant config (redacted):** Standard heartbeat config with `every: 1m` **Steps** 1. Configure agent with heartbeat enabled. 2. Simulate failure condition (invalid API key, network block, or mock failure in `runOnce`). 3. Observe initial retries every 1 minute. 4. After 5 consecutive failures, observe circuit opens with exponential backoff (2m, 4m, 8m...). 5. After 5 minutes of no attempts, observe circuit allows retry on next scheduled interval. 6. Restore service and observe circuit closes on successful heartbeat. **Expected** * Circuit opens after 5 failures with backoff delay. * Log messages indicate circuit state transitions. * No infinite rapid retry loops. * Automatic recovery when service is restored. **Actual** * Circuit breaker correctly tracks failures and applies backoff. * Logs show clear state transitions. * Backoff caps at 30 minutes max. * Recovery works as expected. **Evidence** Attach at least one: - [x] Failing test/log before + passing after - [x] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) **Log snippets:** *Before (simulated failure):* heartbeat failed: API request failed heartbeat failed: API request failed heartbeat failed: API request failed ... (repeats every minute indefinitely) *After (with circuit breaker):* heartbeat failed: API request failed heartbeat: circuit breaker opened for agent-1 after 5 failures, next retry in 120s heartbeat: circuit breaker open for agent-1, deferring heartbeat: circuit breaker open for agent-1, deferring ... (4 minutes later) heartbeat: circuit breaker closed for agent-1 after successful run **Human Verification (required)** What you personally verified (not just CI), and how: **Verified scenarios:** * Circuit opens after exactly 5 failures. * Backoff doubles each time (2x, 4x, 8x, 16x, 32x). * Backoff caps at 30 minutes maximum. * Circuit resets after 5 minutes of cooldown. * Successful heartbeat closes circuit and resets failure count. * Targeted wake requests (with agentId/sessionKey) respect circuit breaker. * Interval-based runs respect circuit breaker. **Edge cases checked:** * Exception thrown in `runOnce` counts as failure. * Mixed success/failure pattern doesn't reset prematurely. * Circuit state persists across config updates. * Zero-downtime config reload maintains circuit state. **What you did NOT verify:** * Interaction with multiple simultaneous agents (only tested 1-2 agents). * Very long backoff periods (>30 min) in production. * Memory leak implications over days of operation. **Compatibility / Migration** * Backward compatible? (**Yes**) * Config/env changes? (**No**) * Migration needed? (**No**) **Failure Recovery (if this breaks)** * **How to disable/revert this change quickly:** Restart process or call `updateConfig()` to reset circuit states. * **Files/config to restore:** None (no config changes). * **Known bad symptoms:** * Heartbeats stop entirely and never resume (circuit stuck open). * Excessive log volume from circuit breaker messages. * Memory growth from agent state tracking. **Risks and Mitigations** * **Risk:** Circuit opens too aggressively on transient failures. * *Mitigation:* Threshold is 5 consecutive failures; most transient issues resolve within 1-2 retries. Can be tuned via `CIRCUIT_BREAKER_THRESHOLD` constant if needed. * **Risk:** Circuit stays open too long, delaying recovery detection. * *Mitigation:* 5-minute reset window allows periodic retry attempts even with backoff. Logs clearly indicate when retries are attempted. * **Risk:** State bloat from per-agent circuit tracking. * *Mitigation:* Only 3 extra number/boolean fields per agent; Map is cleared on runner stop. No unbounded growth. <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR adds a circuit breaker pattern with exponential backoff to prevent infinite retry loops when heartbeats fail consistently. The implementation tracks consecutive failures per agent and opens the circuit after 5 failures, applying exponential backoff (2x, 4x, 8x, 16x, 32x) capped at 2 hours. The circuit automatically closes on successful heartbeat execution. Key changes: - Added circuit breaker state tracking (`consecutiveFailures`, `lastFailureMs`, `circuitOpen`) to `HeartbeatAgentState` - Added error detection logic in `runHeartbeatOnce` to convert auth failures and model provider errors into exceptions - Implemented helper functions for calculating backoff delays and updating circuit state - Circuit state persists across config updates through state preservation Issues found: - The PR description states backoff is "capped at 30 minutes max" but the code uses `CIRCUIT_BREAKER_BACKOFF_MAX_MS = 120 * 60 * 1000` (2 hours) - Error detection on lines 741-758 checks `replyResult.text` for error patterns using `isAuthErrorMessage`, but this may produce false positives since `ReplyPayload.text` contains agent reply content, not error messages (already flagged in previous review) - The circuit breaker logic is repeated three times across different code paths (already flagged in previous review) <h3>Confidence Score: 3/5</h3> - This PR is safe to merge with minor issues that should be addressed - The circuit breaker implementation is sound and well-tested, but there are discrepancies between documentation and code (backoff max time), and some issues flagged in previous reviews remain unaddressed (error detection false positives, code duplication). The core logic is correct and the feature adds significant value by preventing runaway failures. - Pay close attention to src/infra/heartbeat-runner.ts lines 72 and 741-758 for the documentation/code mismatch and error detection logic <sub>Last reviewed commit: 7e9d7de</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs