← Back to PRs

#21529: Gateway: allow node health and throttle repeated unauthorized role retries

by doomsday616 open 2026-02-20 02:21 View on GitHub →
gateway size: S
## Summary - I tracked the auth loop down to post-connect method authorization: node clients were connecting successfully, then getting `unauthorized role: node` for `health` and retrying forever. - This change allows node role clients to call `health`, which removes the main failure path behind the retry storm. - I also added a connection-scoped guardrail: repeated unauthorized-role errors from node clients are rate-limited and return a retryable `UNAVAILABLE` with `retryAfterMs`. ## Why this matters - The macOS app can end up in a fixed-interval retry loop that drives gateway/app CPU very high. - Even if clients misbehave, gateway should degrade safely instead of processing endless unauthorized requests. ## What changed - `src/gateway/server-methods.ts` - Added `NODE_ROLE_ALLOWED_METHODS` and included `health`. - Added per-connection unauthorized-role budget (`3` failures in `30s`, `30s` lockout). - Added test-only reset hook for rate-limit state. - `src/gateway/server-methods.control-plane-rate-limit.test.ts` - Added tests for node `health` allow behavior. - Added tests for unauthorized-role throttling behavior. ## Verification - `npm exec --yes pnpm -- vitest run src/gateway/server-methods.control-plane-rate-limit.test.ts` - `npm exec --yes pnpm -- vitest run src/gateway/method-scopes.test.ts` Fixes #21137 Related #21009 <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR fixes a retry storm issue where node clients repeatedly failed authorization when calling `health`, causing high CPU usage. The solution adds `health` to the allowlist for node role clients and implements connection-scoped rate limiting for repeated unauthorized role errors (3 failures in 30s triggers a 30s lockout with `UNAVAILABLE` + `retryAfterMs`). <h3>Confidence Score: 4/5</h3> - Safe to merge with minor considerations around rate limit tuning - Well-tested implementation with comprehensive test coverage. The authorization logic is sound and the rate limiting correctly prevents retry storms. The 3-failure threshold may trigger quickly but includes proper retry signaling. Only minor concern is lack of metrics/observability for monitoring rate limit hits in production. - No files require special attention <sub>Last reviewed commit: e905410</sub> <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs