#18733: feat(infra): add LLM endpoint concurrency limiting (mutex)
gateway
commands
stale
size: L
Cluster:
Signal and Discord Fixes
# Add LLM Endpoint Concurrency Limiting (Mutex)
## Problem
When multiple isolated agents run concurrently and share the same LLM endpoint (especially local LLMs like `llamacpp` or `vllm`), they compete for resources. This often leads to significant slowdowns, timeouts, and resource exhaustion. For example, the `heartbeat-agent` frequently times out when multiple agent sessions overlap on a single-concurrency local model.
## Solution
This PR introduces a **Provider Concurrency Limiter** that provides a mutex/queuing mechanism at the provider endpoint level. It ensures that requests to the same endpoint are serialized or limited to a configurable concurrency level.
### Key Features
- **Per-Endpoint Queuing:** Requests are queued based on the normalized provider ID (host/port for local URLs, or provider name for cloud services).
- **Configurable Limits:** Users can set `maxConcurrent` and `queueTimeoutMs` globally or per-provider.
- **Hot Reload Support:** Concurrency settings are applied instantly during gateway config hot-reloads.
- **Priority Support:** (Internal) Support for priority-based dequeuing (useful for future time-sensitive agent needs).
- **Safe Integration:** Wraps agent execution attempts without modifying the underlying provider API implementations.
## New Configuration Options
Add the following to your `config.yaml` under the `models` section:
```yaml
models:
# Global default concurrency limit
defaultConcurrency:
maxConcurrent: 2
queueTimeoutMs: 60000 # 1 minute
verbose: true
providers:
llamacpp:
baseUrl: "http://localhost:8000"
concurrency:
maxConcurrent: 1 # Strict serialization for local model
queueTimeoutMs: 30000
```
## Implementation Details
- `src/infra/provider-concurrency-limiter.ts`: Core registry and semaphore logic.
- `src/infra/provider-concurrency-loader.ts`: Translates OpenClaw config to limiter settings.
- `src/gateway/server.impl.ts` & `src/gateway/server-reload-handlers.ts`: Initialization and hot-reload logic.
- `src/commands/agent.ts`: Wrapped `runAgentAttempt` to enforce limits during agent execution.
## Testing
- Added comprehensive unit tests in `src/infra/provider-concurrency-limiter.test.ts`.
- Verified hot-reload behavior in a live gateway environment.
- Confirmed that concurrent agent requests queue up correctly when `maxConcurrent: 1` is set for a specific provider.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Introduces a provider-level concurrency limiter (mutex/semaphore) to prevent resource contention when multiple agents share the same LLM endpoint. The implementation adds a global singleton registry with per-provider queuing, priority support, and configurable timeouts, wired into both gateway startup and hot-reload paths.
- The concurrency limiter wraps the **entire agent execution** (`runCliAgent`/`runEmbeddedPiAgent`), not individual LLM API calls. For `maxConcurrent: 1`, this means only one agent session can run at a time per provider — even during tool execution when the LLM is idle. This is a significant design choice worth documenting explicitly.
- No validation on `maxConcurrent` value — setting `maxConcurrent: 0` silently causes all requests to queue indefinitely until timeout, with no warning.
- Previously flagged: provider ID resolution uses `modelOverride` instead of `providerOverride`, which can cause the limiter to be bypassed for per-provider configs.
- Previously flagged: `ProviderConcurrencyConfig` type is duplicated between `types.models.ts` and `provider-concurrency-limiter.ts`.
- Test coverage is solid, covering core semaphore semantics, priority ordering, timeout behavior, and error handling.
<h3>Confidence Score: 3/5</h3>
- This PR is functional but has design concerns around lock granularity and a configuration edge case that should be addressed before merging.
- The core semaphore logic is correct and well-tested. However, the concurrency slot wraps the entire agent run (not individual LLM calls), which may cause unexpected blocking. The maxConcurrent: 0 edge case silently deadlocks all requests. Previously flagged issues (provider ID mismatch, duplicate type) also remain relevant.
- Pay close attention to `src/commands/agent.ts` (lock scope and provider ID resolution) and `src/infra/provider-concurrency-limiter.ts` (missing maxConcurrent validation).
<sub>Last reviewed commit: 0a407e2</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#12998: feat(agents): Add parallel sub-agent execution with concurrency con...
by trevorgordon981 · 2026-02-10
72.2%
#9025: Fix/automatic exponential backoff for LLM rate limits
by fotorpics · 2026-02-04
70.0%
#23188: feat(config): per-channel conversation concurrency override
by El-Fitz · 2026-02-22
69.6%
#22368: fix: first-token timeout + provider-level skip for model fallback
by 88plug · 2026-02-21
68.2%
#14744: fix(context): key MODEL_CACHE by provider/modelId to prevent collis...
by lailoo · 2026-02-12
68.2%
#13686: Add opt-in rate limiting and token-based budgets for external API c...
by ShresthSamyak · 2026-02-10
67.9%
#20149: fix(memory): expose index concurrency as config option
by togotago · 2026-02-18
67.8%
#15632: fix: use provider-qualified key in MODEL_CACHE for context window l...
by linwebs · 2026-02-13
67.8%
#14824: fix: do not trigger provider cooldown on LLM request timeouts
by CyberSinister · 2026-02-12
67.6%
#9266: fix(gateway): configure nested lane concurrency to prevent sessions...
by 100menotu001 · 2026-02-05
67.1%