#22368: fix: first-token timeout + provider-level skip for model fallback

by 88plug open 2026-02-21 02:41 View on GitHub →

agents size: S

Cluster: Model Configuration and Fallback Fixes

## Summary - **Problem:** When an LLM provider is unresponsive (TCP connects but inference endpoint hangs), the full 300s timeout is consumed per model before fallback. With N models on the same dead provider, users wait N×300s with no response. - **Why it matters:** A single provider outage (e.g., NIM going down) makes the entire system unresponsive for 5-15+ minutes despite having fallback models configured on other providers (e.g., Ollama). - **What changed:** Added a 30s first-token timeout that detects dead endpoints early, and provider-level skip logic that avoids retrying models on a provider that already timed out. - **What did NOT change:** No changes to config schema, API contracts, or existing timeout behavior for working providers. The full `timeoutMs` still applies for slow-but-responsive models. ## Change Type (select all) - [x] Bug fix ## Scope (select all touched areas) - [x] Gateway / orchestration ## Linked Issue/PR - Related #22364 - Related #5980 - Related #8724 - Related #11715 ## User-visible / Behavior Changes - Dead LLM provider endpoints are now detected within ~30s instead of 300s - Fallback models on different providers are reached much faster during provider outages - New optional parameter `firstTokenTimeoutMs` (default: 30000) can be passed to customize the detection window. 0 disables it. ## Security Impact (required) - New permissions/capabilities? `No` - Secrets/tokens handling changed? `No` - New/changed network calls? `No` - Command/tool execution surface changed? `No` - Data access scope changed? `No` ## Repro + Verification ### Environment - OS: Linux (Docker AIO) - Runtime/container: Node 22, Docker - Model/provider: NIM (DeepSeek V3.2, Kimi K2, Nemotron Super) + Ollama fallback - Relevant config: Multiple NIM models as primary/fallbacks, Ollama as last resort ### Steps 1. Configure primary model + fallbacks on same provider (e.g., 3 NIM models + 1 Ollama) 2. Provider inference endpoint becomes unresponsive (TCP connects, no SSE data) 3. Send a message ### Expected - Dead endpoint detected within 30s, fallback to Ollama within ~35s ### Actual (before fix) - First model hangs 300s, second model hangs 300s, third model hangs 300s - Ollama never reached (run timeout exceeded) ## Evidence - [x] Failing test/log before + passing after - [x] Trace/log snippets New test `src/agents/model-fallback.test.ts` verifies: 1. Provider skip after timeout (only 2 actual calls instead of 3) 2. Non-timeout failures do NOT trigger provider skip Log evidence from production: `LLM request timed out after 300061ms` with all NIM models failing sequentially. ## Human Verification (required) - Verified scenarios: Provider timeout triggers skip, rate-limit does NOT trigger skip, first-token timer clears on response - Edge cases checked: Timer cleanup in finally block, probe sessions suppressed, 0 disables first-token timeout - What you did **not** verify: Live integration test with actual dead NIM endpoint (reproduced via unit test mocks) ## Compatibility / Migration - Backward compatible? `Yes` - Config/env changes? `No` (new param is optional with sensible default) - Migration needed? `No` ## Failure Recovery (if this breaks) - How to disable/revert: Set `firstTokenTimeoutMs: 0` in run params to disable first-token timeout - Known bad symptoms: If a legitimate model takes >30s to emit first token, it would be incorrectly aborted. This is unlikely for responsive providers but possible for very slow local models. Users can increase the timeout. ## Risks and Mitigations - Risk: Slow but working models aborted by 30s first-token timeout - Mitigation: 30s is generous for first token (most models respond in <5s). Configurable via `firstTokenTimeoutMs`. Local Ollama models that are slow would typically be the last fallback candidate anyway. --- 🤖 AI-assisted (Claude). Lightly tested via unit tests. Code reviewed and understood.  <h3>Greptile Summary</h3> This PR adds two complementary optimizations to the model fallback system for faster recovery from unresponsive LLM providers: - **First-token timeout** (`attempt.ts`): A configurable timer (default 30s) is started before each `prompt()` call and cleared when the model begins responding (`onAssistantMessageStart`). If no response arrives within the window, the attempt is aborted early via `abortRun(true)`, producing a proper `TimeoutError` that the fallback system recognizes. The timer is also cleaned up in the `finally` block as a safety net. Probe sessions correctly suppress warning logs. - **Provider-level skip** (`model-fallback.ts`): When a model times out, its provider is recorded in a `timedOutProviders` Set. Subsequent fallback candidates on the same provider are skipped immediately (with a synthetic "skipped" attempt record), avoiding redundant waits against a dead endpoint. Non-timeout failures (rate limits, auth errors) do not trigger provider skip. - **Parameter plumbing** (`params.ts`, `run.ts`): The new optional `firstTokenTimeoutMs` parameter is threaded through the params type and passthrough layer. - **Tests** (`model-fallback.test.ts`): Two tests verify provider-skip after timeout and non-skip for rate-limit errors. Minor unused import of `_probeThrottleInternals`. <h3>Confidence Score: 4/5</h3> - This PR is safe to merge — it adds an opt-out-able optimization with correct timeout classification and cleanup, and does not change behavior for responsive providers. - The implementation is well-structured with proper timer cleanup, correct abort classification (TimeoutError flows through to FailoverError detection), and defensive handling (finally block, probe session suppression). The provider-skip logic is sound and correctly scoped to timeout failures only. The only minor issue is an unused import in the test file. Score is 4 rather than 5 because the test coverage is limited to the provider-skip logic in model-fallback.ts and does not include integration-level tests for the first-token timeout in attempt.ts. - No files require special attention. `src/agents/pi-embedded-runner/run/attempt.ts` has the most impactful change (first-token timeout mechanism) but the implementation is clean. <sub>Last reviewed commit: fb17761</sub>