#21529: Gateway: allow node health and throttle repeated unauthorized role retries
gateway
size: S
Cluster:
Security Enhancements and Fixes
## Summary
- I tracked the auth loop down to post-connect method authorization: node clients were connecting successfully, then getting `unauthorized role: node` for `health` and retrying forever.
- This change allows node role clients to call `health`, which removes the main failure path behind the retry storm.
- I also added a connection-scoped guardrail: repeated unauthorized-role errors from node clients are rate-limited and return a retryable `UNAVAILABLE` with `retryAfterMs`.
## Why this matters
- The macOS app can end up in a fixed-interval retry loop that drives gateway/app CPU very high.
- Even if clients misbehave, gateway should degrade safely instead of processing endless unauthorized requests.
## What changed
- `src/gateway/server-methods.ts`
- Added `NODE_ROLE_ALLOWED_METHODS` and included `health`.
- Added per-connection unauthorized-role budget (`3` failures in `30s`, `30s` lockout).
- Added test-only reset hook for rate-limit state.
- `src/gateway/server-methods.control-plane-rate-limit.test.ts`
- Added tests for node `health` allow behavior.
- Added tests for unauthorized-role throttling behavior.
## Verification
- `npm exec --yes pnpm -- vitest run src/gateway/server-methods.control-plane-rate-limit.test.ts`
- `npm exec --yes pnpm -- vitest run src/gateway/method-scopes.test.ts`
Fixes #21137
Related #21009
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR fixes a retry storm issue where node clients repeatedly failed authorization when calling `health`, causing high CPU usage. The solution adds `health` to the allowlist for node role clients and implements connection-scoped rate limiting for repeated unauthorized role errors (3 failures in 30s triggers a 30s lockout with `UNAVAILABLE` + `retryAfterMs`).
<h3>Confidence Score: 4/5</h3>
- Safe to merge with minor considerations around rate limit tuning
- Well-tested implementation with comprehensive test coverage. The authorization logic is sound and the rate limiting correctly prevents retry storms. The 3-failure threshold may trigger quickly but includes proper retry signaling. Only minor concern is lack of metrics/observability for monitoring rate limit hits in production.
- No files require special attention
<sub>Last reviewed commit: e905410</sub>
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#19887: fix: allow node role to call health RPC
by apple-techie · 2026-02-18
84.7%
#23355: Gateway: fail closed on untrusted proxy headers
by bmendonca3 · 2026-02-22
76.7%
#16963: fix: enable auth rate limiting by default
by StressTestor · 2026-02-15
76.6%
#23361: Gateway: reject scope assertions without identity binding
by bmendonca3 · 2026-02-22
76.2%
#22766: fix(security): enable gateway auth rate limiting by default (CWE-307)
by brandonwise · 2026-02-21
76.1%
#19437: Gateway: respect custom bind host for local health/RPC target resol...
by frudas24 · 2026-02-17
75.4%
#8513: Gateway: require auth for plugin HTTP
by coygeek · 2026-02-04
75.3%
#2530: fix(gateway): improve auth error for native apps
by Episkey-G · 2026-01-27
75.1%
#8260: fix(macOS): gateway readiness detection + reversible Configure later
by xksteven · 2026-02-03
75.1%
#21944: feat(gateway): crash-loop protection with escalating backoff
by Protocol-zero-0 · 2026-02-20
75.1%