← Back to PRs

#22355: fix(gateway): add exponential backoff to remote node bin probes

by xinhuagu open 2026-02-21 02:24 View on GitHub →
size: S trusted-contributor
## Problem When a macOS companion node connects with only `canvas`/`screen` capabilities (no `system` shell), the gateway still probes for shell binaries via `system.run` or `system.which`. Each probe times out after 15 seconds. If the node disconnects and reconnects frequently (e.g. healthcheck stuck → disconnect → reconnect cycle), the gateway enters a near-continuous probe loop, reaching ~84% CPU on Apple Silicon (#22266). ## Fix Add per-node exponential backoff to `refreshRemoteNodeBins()`: - On probe failure: delay the next attempt by 30s, doubling on each consecutive failure up to 5 minutes - On probe success: clear backoff immediately - On node disconnect (`removeRemoteNodeInfo`): clear backoff state This ensures transient failures are retried with reasonable spacing while persistent failures (e.g. nodes without shell support) quickly settle to a 5-minute interval instead of a tight loop. ## Changes - `src/infra/skills-remote.ts`: Add `probeBackoff` map, `escalateProbeBackoff()` helper, backoff check at probe entry, clear on success/removal - `src/infra/skills-remote.test.ts`: Add test verifying second probe is skipped after failure Closes #22266 <!-- greptile_comment --> <h3>Greptile Summary</h3> Added exponential backoff to remote node binary probes to prevent CPU overload when macOS companion nodes disconnect/reconnect frequently. On probe failure, the next attempt is delayed by 30s (doubling up to 5 min max); on success or node disconnect, backoff is cleared. The change prevents tight probe loops when nodes lack shell support, reducing gateway CPU from ~84% to near-zero during reconnect cycles. <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk - The implementation is straightforward, well-tested, and addresses a specific performance issue. The exponential backoff pattern is standard and correctly implemented with proper cleanup on success and node removal. The test coverage validates the core backoff behavior. - No files require special attention <sub>Last reviewed commit: cf6160a</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs