#18733: feat(infra): add LLM endpoint concurrency limiting (mutex)

by clawmander open 2026-02-17 01:01 View on GitHub →

gateway commands stale size: L

# Add LLM Endpoint Concurrency Limiting (Mutex) ## Problem When multiple isolated agents run concurrently and share the same LLM endpoint (especially local LLMs like `llamacpp` or `vllm`), they compete for resources. This often leads to significant slowdowns, timeouts, and resource exhaustion. For example, the `heartbeat-agent` frequently times out when multiple agent sessions overlap on a single-concurrency local model. ## Solution This PR introduces a **Provider Concurrency Limiter** that provides a mutex/queuing mechanism at the provider endpoint level. It ensures that requests to the same endpoint are serialized or limited to a configurable concurrency level. ### Key Features - **Per-Endpoint Queuing:** Requests are queued based on the normalized provider ID (host/port for local URLs, or provider name for cloud services). - **Configurable Limits:** Users can set `maxConcurrent` and `queueTimeoutMs` globally or per-provider. - **Hot Reload Support:** Concurrency settings are applied instantly during gateway config hot-reloads. - **Priority Support:** (Internal) Support for priority-based dequeuing (useful for future time-sensitive agent needs). - **Safe Integration:** Wraps agent execution attempts without modifying the underlying provider API implementations. ## New Configuration Options Add the following to your `config.yaml` under the `models` section: ```yaml models: # Global default concurrency limit defaultConcurrency: maxConcurrent: 2 queueTimeoutMs: 60000 # 1 minute verbose: true providers: llamacpp: baseUrl: "http://localhost:8000" concurrency: maxConcurrent: 1 # Strict serialization for local model queueTimeoutMs: 30000 ``` ## Implementation Details - `src/infra/provider-concurrency-limiter.ts`: Core registry and semaphore logic. - `src/infra/provider-concurrency-loader.ts`: Translates OpenClaw config to limiter settings. - `src/gateway/server.impl.ts` & `src/gateway/server-reload-handlers.ts`: Initialization and hot-reload logic. - `src/commands/agent.ts`: Wrapped `runAgentAttempt` to enforce limits during agent execution. ## Testing - Added comprehensive unit tests in `src/infra/provider-concurrency-limiter.test.ts`. - Verified hot-reload behavior in a live gateway environment. - Confirmed that concurrent agent requests queue up correctly when `maxConcurrent: 1` is set for a specific provider.  <h3>Greptile Summary</h3> Introduces a provider-level concurrency limiter (mutex/semaphore) to prevent resource contention when multiple agents share the same LLM endpoint. The implementation adds a global singleton registry with per-provider queuing, priority support, and configurable timeouts, wired into both gateway startup and hot-reload paths. - The concurrency limiter wraps the **entire agent execution** (`runCliAgent`/`runEmbeddedPiAgent`), not individual LLM API calls. For `maxConcurrent: 1`, this means only one agent session can run at a time per provider — even during tool execution when the LLM is idle. This is a significant design choice worth documenting explicitly. - No validation on `maxConcurrent` value — setting `maxConcurrent: 0` silently causes all requests to queue indefinitely until timeout, with no warning. - Previously flagged: provider ID resolution uses `modelOverride` instead of `providerOverride`, which can cause the limiter to be bypassed for per-provider configs. - Previously flagged: `ProviderConcurrencyConfig` type is duplicated between `types.models.ts` and `provider-concurrency-limiter.ts`. - Test coverage is solid, covering core semaphore semantics, priority ordering, timeout behavior, and error handling. <h3>Confidence Score: 3/5</h3> - This PR is functional but has design concerns around lock granularity and a configuration edge case that should be addressed before merging. - The core semaphore logic is correct and well-tested. However, the concurrency slot wraps the entire agent run (not individual LLM calls), which may cause unexpected blocking. The maxConcurrent: 0 edge case silently deadlocks all requests. Previously flagged issues (provider ID mismatch, duplicate type) also remain relevant. - Pay close attention to `src/commands/agent.ts` (lock scope and provider ID resolution) and `src/infra/provider-concurrency-limiter.ts` (missing maxConcurrent validation). <sub>Last reviewed commit: 0a407e2</sub>