← Back to PRs

#18185: feat(slack): add configurable rateLimitPolicy for Slack API calls

by pkrmf open 2026-02-16 15:45 View on GitHub →
docs channel: slack gateway size: M
Slack users.list (Tier 2, ~20 req/min) rate limiting can cause cascading failures: 429 retries starve the WebSocket connection, killing pong replies and eventually DNS resolution. When all allowlist entries are user IDs or emails, we can resolve them with users.info (Tier 4, ~100+ req/min) or users.lookupByEmail (Tier 3) in a single call each. Add `rateLimitPolicy?: "retry" | "fail-fast"` to SlackAccountConfig: - "retry" (default): original behavior — always paginate users.list, retry 429s in directory operations. Zero behavioral change. - "fail-fast": use targeted APIs (users.info, users.lookupByEmail) for ID/email entries, reject rate-limited calls immediately in bulk directory operations. Only falls back to users.list for name entries. Threaded through all callers: monitor/provider, commands-allowlist, and onboarding wizard. ## Summary - **Problem:** `resolveSlackUserAllowlist()` always paginates ALL workspace users via `users.list` (Tier 2, ~20 req/min) even when every entry is a user ID resolvable with a single `users.info` call (Tier 4,~100+ req/min). - **Why it matters:** On large workspaces (i.e enteprise workspace), 429 retry loops starve the WebSocket connection: pong replies time out, then DNS resolution fails, cascading into full disconnection. - **What changed:** Added `rateLimitPolicy?: "retry" | "fail-fast"` config option. When `"fail-fast"`: IDs resolve via `users.info`, emails via `users.lookupByEmail`, directory ops use `rejectRateLimitedCalls:true`. Names still fall back to `users.list`. Threaded the config through all 5 callers. - **What did NOT change (scope boundary):** Default behavior is identical: no config means `"retry"`, which is the existing `users.list` + retry path. No changes to Slack event handling, message processing, or WebSocket management. Issue Example: ``` 15:56:25 [slack] [default] starting provider 09:56:25 [WARN] bolt-app http request failed getaddrinfo ENOTFOUND slack.com 09:56:25 [WARN] bolt-app http request failed getaddrinfo ENOTFOUND slack.com 09:56:26 [WARN] bolt-app http request failed getaddrinfo ENOTFOUND slack.com 09:56:26 [WARN] bolt-app http request failed getaddrinfo ENOTFOUND slack.com 09:56:26 [WARN] web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com 09:56:27 [WARN] web-api:WebClient:0 http request failed getaddrinfo ENOTFOUND slack.com 09:56:27 [WARN] bolt-app http request failed getaddrinfo ENOTFOUND slack.com 09:56:27 [openclaw] Unhandled promise rejection: Error: A request error occurred: getaddrinfo ENOTFOUND slack.com at requestErrorWithOriginal (/Users/mterns/Documents/AI/openclaw/node_modules/.pnpm/@slack+web-api@7.14.0/node_modules/@slack/web-api/src/errors.ts:82:5) at WebClient.<anonymous> (/Users/mterns/Documents/AI/openclaw/node_modules/.pnpm/@slack+web-api@7.14.0/node_modules/@slack/web-api/src/WebClient.ts:767:43) at Generator.throw (<anonymous>) at rejected (/Users/mterns/Documents/AI/openclaw/node_modules/.pnpm/@slack+web-api@7.14.0/node_modules/@slack/web-api/dist/WebClient.js:39:65) at processTicksAndRejections (node:internal/process/task_queues:105:5)  ELIFECYCLE  Command failed with exit code 1. ``` ## Change Type (select all) - [ ] Bug fix - [ x] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [ ] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ x] Integrations - [x ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Related: rate-limit cascading failure observed after rebasing with main ## User-visible / Behavior Changes - New optional config field: `channels.slack.rateLimitPolicy` (`"retry"` | `"fail-fast"`) - Default (`"retry"` / unset): no change whatsoever - `"fail-fast"`: allowlist resolution uses targeted APIs instead of full workspace pagination; directory lookups reject 429s immediately instead of retrying ## Security Impact (required) - New permissions/capabilities? `No` - Secrets/tokens handling changed? `No` - New/changed network calls? `Yes` — when `"fail-fast"`, calls `users.info` and `users.lookupByEmail` instead of `users.list`. Same token, same Slack API, narrower scope (single user vs. all users). - Command/tool execution surface changed? `No` - Data access scope changed? `No` — targeted APIs return the same user fields as `users.list`, just for one user at a time. ## Repro + Verification ### Environment - OS: macOS Darwin 25.1.0 - Runtime: Node 22 / pnpm - Integration: Slack (socket mode) - Relevant config: `{ "channels": { "slack": { "rateLimitPolicy": "fail-fast" } } }` ### Steps 1. Set `rateLimitPolicy: "fail-fast"` in Slack config with ID-only allowlist entries 2. Start the bot — observe allowlist resolution uses `users.info` per ID (no `users.list` pagination) 3. Remove `rateLimitPolicy` — observe original `users.list` behavior unchanged ### Expected - `"fail-fast"`: resolves IDs via `users.info`, no `users.list` calls when all entries are IDs - Default: identical to current main branch behavior ### Actual - Build passes, tests pass (no new failures) ## Evidence - [x] `pnpm build` — clean - [x] `pnpm check` — pre-commit hooks: 0 warnings, 0 errors on all changed files - [x] `pnpm test` — no new failures (13 failures are pre-existing memory tests on main) ## Human Verification (required) - **Verified scenarios:** Build, lint, format, full test suite - **Edge cases checked:** Mixed allowlist (IDs + names) falls back to `users.list` only for the name entries; empty/unset `rateLimitPolicy` preserves exact original code path ## Compatibility / Migration - Backward compatible? `Yes` — default is `"retry"`, identical to current behavior - Config/env changes? `Yes` — new optional field `rateLimitPolicy` in `SlackAccountConfig` - Migration needed? `No` — existing configs work without changes ## Failure Recovery (if this breaks) - How to disable/revert: remove `rateLimitPolicy` from config (or set to `"retry"`) — reverts to original code path immediately, no restart needed beyond config reload - Known bad symptoms: if `"fail-fast"` is set and the bot needs name-based resolution on a rate-limited workspace, those name lookups still hit `users.list` and will fail fast instead of retrying ## Risks and Mitigations - **Risk:** `"fail-fast"` mode rejects 429s in directory operations, so large-workspace directory lookups that previously would have eventually succeeded will now fail. - **Mitigation:** Opt-in only — users must explicitly set `rateLimitPolicy: "fail-fast"`. Default behavior is unchanged. Removing the config key instantly reverts. <!-- greptile_comment --> <h3>Greptile Summary</h3> Adds optional `rateLimitPolicy` config field to prevent Slack rate-limit cascading failures. When set to `"fail-fast"`, resolves allowlist entries using targeted APIs (`users.info`, `users.lookupByEmail`) instead of paginating `users.list`, and rejects 429s immediately in bulk directory operations. **Key changes:** - New config field `channels.slack.rateLimitPolicy` (`"retry"` | `"fail-fast"`, default `"retry"`) - `createSlackWebClientBulk` helper creates clients with `rejectRateLimitedCalls: true` - `resolveViaTargetedApis` resolves IDs/emails via Tier 3/4 APIs, only falls back to `users.list` for name entries - Threaded through allowlist resolution, directory lookups, and onboarding **Critical issue:** `resolve-users.ts:274` doesn't use `createSlackWebClientBulk` for fail-fast mode, so targeted API calls and the `listSlackUsers` fallback will still retry on 429s instead of failing immediately. This undermines the entire purpose of the fail-fast policy. <h3>Confidence Score: 2/5</h3> - This PR contains a critical logic bug that breaks the core "fail-fast" functionality - The implementation has a significant bug in `resolve-users.ts` where the fail-fast policy doesn't actually use a client configured to reject rate-limited calls. This means the feature won't work as intended when `rateLimitPolicy: "fail-fast"` is set - it will still retry on 429s instead of failing fast. The fix is straightforward (use `createSlackWebClientBulk` when fail-fast is enabled), but the bug must be addressed before merging. - `src/slack/resolve-users.ts` requires immediate fix - the fail-fast client implementation is broken <sub>Last reviewed commit: 59add71</sub> <!-- greptile_other_comments_section --> <sub>(5/5) You can turn off certain types of comments like style [here](https://app.greptile.com/review/github)!</sub> <!-- /greptile_comment -->

Most Similar PRs