#22355: fix(gateway): add exponential backoff to remote node bin probes
size: S
trusted-contributor
Cluster:
Gateway and TLS Enhancements
## Problem
When a macOS companion node connects with only `canvas`/`screen` capabilities (no `system` shell), the gateway still probes for shell binaries via `system.run` or `system.which`. Each probe times out after 15 seconds. If the node disconnects and reconnects frequently (e.g. healthcheck stuck → disconnect → reconnect cycle), the gateway enters a near-continuous probe loop, reaching ~84% CPU on Apple Silicon (#22266).
## Fix
Add per-node exponential backoff to `refreshRemoteNodeBins()`:
- On probe failure: delay the next attempt by 30s, doubling on each consecutive failure up to 5 minutes
- On probe success: clear backoff immediately
- On node disconnect (`removeRemoteNodeInfo`): clear backoff state
This ensures transient failures are retried with reasonable spacing while persistent failures (e.g. nodes without shell support) quickly settle to a 5-minute interval instead of a tight loop.
## Changes
- `src/infra/skills-remote.ts`: Add `probeBackoff` map, `escalateProbeBackoff()` helper, backoff check at probe entry, clear on success/removal
- `src/infra/skills-remote.test.ts`: Add test verifying second probe is skipped after failure
Closes #22266
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Added exponential backoff to remote node binary probes to prevent CPU overload when macOS companion nodes disconnect/reconnect frequently. On probe failure, the next attempt is delayed by 30s (doubling up to 5 min max); on success or node disconnect, backoff is cleared. The change prevents tight probe loops when nodes lack shell support, reducing gateway CPU from ~84% to near-zero during reconnect cycles.
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk
- The implementation is straightforward, well-tested, and addresses a specific performance issue. The exponential backoff pattern is standard and correctly implemented with proper cleanup on success and node removal. The test coverage validates the core backoff behavior.
- No files require special attention
<sub>Last reviewed commit: cf6160a</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#21529: Gateway: allow node health and throttle repeated unauthorized role ...
by doomsday616 · 2026-02-20
74.4%
#9879: Improve gateway probe diagnostics for slow channels
by wrgrant · 2026-02-05
72.8%
#8260: fix(macOS): gateway readiness detection + reversible Configure later
by xksteven · 2026-02-03
72.3%
#23413: Skills: gate remote eligibility expansion behind explicit opt-in
by bmendonca3 · 2026-02-22
71.4%
#22110: fix(tools): prefer loopback for internal tool-to-gateway RPC calls
by pierreeurope · 2026-02-20
71.1%
#10636: fix: setTimeout integer overflow causing server crash
by devmangel · 2026-02-06
70.9%
#4086: Test/add backoff tests
by TechWizard9999 · 2026-01-29
70.5%
#9178: Fix: GatewayClient queueConnect() setTimeout never fires
by vishaltandale00 · 2026-02-04
70.4%
#22804: fix: prioritize loopback for internal gateway calls (issue #22706)
by ambicuity · 2026-02-21
70.3%
#18112: fix(daemon): gateway install on macOS ignores fnm/nvm node (#18090)
by yinghaosang · 2026-02-16
70.3%