← Back to PRs

#9184: Fix: Heartbeat timer not resuming after macOS sleep/wake cycle

by vishaltandale00 open 2026-02-05 00:08 View on GitHub →
stale
## Problem The heartbeat timer in OpenClaw gateway doesn't reliably resume after the host system (macOS) goes to sleep and wakes up. This leaves heartbeats permanently stalled until the gateway is manually restarted. **Root Cause:** Node.js `setTimeout` doesn't account for system sleep. When the system sleeps, the timer deadline passes, and upon wake, the timer may be in an invalid state where `scheduleNext()` never gets called again. Fixes #9084 ## Solution Implemented a **drift detection watchdog** that checks every 60 seconds whether the main heartbeat timer has become stale: 1. **Track scheduled due time**: Store `lastScheduledDueMs` when scheduling a heartbeat 2. **Watchdog checks drift**: Every 60s, compare current time against `lastScheduledDueMs` 3. **Trigger if overdue**: If drift > 5s (grace period), immediately trigger the overdue heartbeat 4. **Continue monitoring**: Watchdog keeps running until the runner is stopped ## Implementation Details ### Added to state: - `watchdogTimer`: Secondary timer that runs every 60s - `lastScheduledDueMs`: Track when the next heartbeat should fire ### New function `startWatchdog()`: ### Updated `scheduleNext()`: - Records `lastScheduledDueMs = nextDue` when scheduling - Clears it when timer fires naturally - Watchdog can detect stale state ### Updated `cleanup()`: - Clears both main timer and watchdog timer - Proper cleanup on runner stop ## Why This Works - **setTimeout limitations**: The main timer can fail during sleep - **Watchdog resilience**: Even if the watchdog's timer also drifts, it re-schedules itself every 60s - **Cross-platform**: Works on all OSes without needing platform-specific wake listeners - **Minimal overhead**: 60s check interval, unref'd timer doesn't prevent exit - **Grace period**: 5s grace avoids false positives from normal scheduling jitter ## Testing **Manual test scenario:** 1. Configure heartbeat with `every: "15m"` 2. Start gateway, verify heartbeat running 3. Close laptop lid during heartbeat interval 4. Wake system after >15 minutes 5. **Expected**: Heartbeat fires within 60s (watchdog detects drift) 6. **Previous**: Heartbeat never fired again **Automated tests:** - Existing heartbeat tests still pass - TypeScript compilation passes - No breaking changes to public API ## Trade-offs Considered **Alternative 1: System wake listener** - ❌ Platform-specific (macOS, Windows, Linux all different) - ❌ More complex implementation - ✅ Our approach: Works everywhere **Alternative 2: setInterval instead of setTimeout** - ❌ setInterval has same sleep/wake issues - ❌ Doesn't fire missed intervals - ✅ Our approach: Detects and recovers from missed intervals **Alternative 3: Shorter main timer with drift correction** - ❌ More API calls / higher overhead - ✅ Our approach: Separate watchdog, minimal overhead ## Impact ✅ Fixes heartbeat stalling after sleep/wake on macOS ✅ Also fixes same issue on Windows/Linux laptop users ✅ No breaking changes ✅ Minimal performance impact (60s watchdog) ✅ Backward compatible ✅ No configuration changes needed --- **Related:** This issue is similar to #9178 (GatewayClient setTimeout reliability) - both address Node.js timer limitations during system sleep. <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR adds a drift-detection watchdog to the heartbeat runner (`src/infra/heartbeat-runner.ts`) by tracking the scheduled due time for the next heartbeat and polling every 60s to detect overdue timers (e.g., after macOS sleep/wake), then triggering an immediate heartbeat. It also adjusts tool-result delivery in the agent runner (`src/auto-reply/reply/agent-runner-execution.ts`) to route tool results through the block reply pipeline to better preserve message ordering during streaming. <h3>Confidence Score: 3/5</h3> - Moderately safe to merge after addressing two ordering/runner-lifecycle issues. - The heartbeat watchdog approach is contained, but the watchdog lifecycle isn’t tied to config updates, and the tool-result pipeline enqueue is not awaited despite prior race-condition fixes in this area. - src/auto-reply/reply/agent-runner-execution.ts, src/infra/heartbeat-runner.ts <!-- greptile_other_comments_section --> **Context used:** - Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=fd949e91-5c3a-4ab5-90a1-cbe184fd6ce8)) - Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=0d0c8278-ef8e-4d6c-ab21-f5527e322f13)) <!-- /greptile_comment -->

Most Similar PRs