#22368: fix: first-token timeout + provider-level skip for model fallback
agents
size: S
## Summary
- **Problem:** When an LLM provider is unresponsive (TCP connects but inference endpoint hangs), the full 300s timeout is consumed per model before fallback. With N models on the same dead provider, users wait N×300s with no response.
- **Why it matters:** A single provider outage (e.g., NIM going down) makes the entire system unresponsive for 5-15+ minutes despite having fallback models configured on other providers (e.g., Ollama).
- **What changed:** Added a 30s first-token timeout that detects dead endpoints early, and provider-level skip logic that avoids retrying models on a provider that already timed out.
- **What did NOT change:** No changes to config schema, API contracts, or existing timeout behavior for working providers. The full `timeoutMs` still applies for slow-but-responsive models.
## Change Type (select all)
- [x] Bug fix
## Scope (select all touched areas)
- [x] Gateway / orchestration
## Linked Issue/PR
- Related #22364
- Related #5980
- Related #8724
- Related #11715
## User-visible / Behavior Changes
- Dead LLM provider endpoints are now detected within ~30s instead of 300s
- Fallback models on different providers are reached much faster during provider outages
- New optional parameter `firstTokenTimeoutMs` (default: 30000) can be passed to customize the detection window. 0 disables it.
## Security Impact (required)
- New permissions/capabilities? `No`
- Secrets/tokens handling changed? `No`
- New/changed network calls? `No`
- Command/tool execution surface changed? `No`
- Data access scope changed? `No`
## Repro + Verification
### Environment
- OS: Linux (Docker AIO)
- Runtime/container: Node 22, Docker
- Model/provider: NIM (DeepSeek V3.2, Kimi K2, Nemotron Super) + Ollama fallback
- Relevant config: Multiple NIM models as primary/fallbacks, Ollama as last resort
### Steps
1. Configure primary model + fallbacks on same provider (e.g., 3 NIM models + 1 Ollama)
2. Provider inference endpoint becomes unresponsive (TCP connects, no SSE data)
3. Send a message
### Expected
- Dead endpoint detected within 30s, fallback to Ollama within ~35s
### Actual (before fix)
- First model hangs 300s, second model hangs 300s, third model hangs 300s
- Ollama never reached (run timeout exceeded)
## Evidence
- [x] Failing test/log before + passing after
- [x] Trace/log snippets
New test `src/agents/model-fallback.test.ts` verifies:
1. Provider skip after timeout (only 2 actual calls instead of 3)
2. Non-timeout failures do NOT trigger provider skip
Log evidence from production: `LLM request timed out after 300061ms` with all NIM models failing sequentially.
## Human Verification (required)
- Verified scenarios: Provider timeout triggers skip, rate-limit does NOT trigger skip, first-token timer clears on response
- Edge cases checked: Timer cleanup in finally block, probe sessions suppressed, 0 disables first-token timeout
- What you did **not** verify: Live integration test with actual dead NIM endpoint (reproduced via unit test mocks)
## Compatibility / Migration
- Backward compatible? `Yes`
- Config/env changes? `No` (new param is optional with sensible default)
- Migration needed? `No`
## Failure Recovery (if this breaks)
- How to disable/revert: Set `firstTokenTimeoutMs: 0` in run params to disable first-token timeout
- Known bad symptoms: If a legitimate model takes >30s to emit first token, it would be incorrectly aborted. This is unlikely for responsive providers but possible for very slow local models. Users can increase the timeout.
## Risks and Mitigations
- Risk: Slow but working models aborted by 30s first-token timeout
- Mitigation: 30s is generous for first token (most models respond in <5s). Configurable via `firstTokenTimeoutMs`. Local Ollama models that are slow would typically be the last fallback candidate anyway.
---
🤖 AI-assisted (Claude). Lightly tested via unit tests. Code reviewed and understood.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR adds two complementary optimizations to the model fallback system for faster recovery from unresponsive LLM providers:
- **First-token timeout** (`attempt.ts`): A configurable timer (default 30s) is started before each `prompt()` call and cleared when the model begins responding (`onAssistantMessageStart`). If no response arrives within the window, the attempt is aborted early via `abortRun(true)`, producing a proper `TimeoutError` that the fallback system recognizes. The timer is also cleaned up in the `finally` block as a safety net. Probe sessions correctly suppress warning logs.
- **Provider-level skip** (`model-fallback.ts`): When a model times out, its provider is recorded in a `timedOutProviders` Set. Subsequent fallback candidates on the same provider are skipped immediately (with a synthetic "skipped" attempt record), avoiding redundant waits against a dead endpoint. Non-timeout failures (rate limits, auth errors) do not trigger provider skip.
- **Parameter plumbing** (`params.ts`, `run.ts`): The new optional `firstTokenTimeoutMs` parameter is threaded through the params type and passthrough layer.
- **Tests** (`model-fallback.test.ts`): Two tests verify provider-skip after timeout and non-skip for rate-limit errors. Minor unused import of `_probeThrottleInternals`.
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — it adds an opt-out-able optimization with correct timeout classification and cleanup, and does not change behavior for responsive providers.
- The implementation is well-structured with proper timer cleanup, correct abort classification (TimeoutError flows through to FailoverError detection), and defensive handling (finally block, probe session suppression). The provider-skip logic is sound and correctly scoped to timeout failures only. The only minor issue is an unused import in the test file. Score is 4 rather than 5 because the test coverage is limited to the provider-skip logic in model-fallback.ts and does not include integration-level tests for the first-token timeout in attempt.ts.
- No files require special attention. `src/agents/pi-embedded-runner/run/attempt.ts` has the most impactful change (first-token timeout mechanism) but the implementation is clean.
<sub>Last reviewed commit: fb17761</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#23816: fix(agents): model fallback skipped during session overrides and pr...
by ramezgaberiel · 2026-02-22
82.7%
#13626: fix(model): propagate provider model properties in fallback resolution
by mcaxtr · 2026-02-10
79.9%
#14744: fix(context): key MODEL_CACHE by provider/modelId to prevent collis...
by lailoo · 2026-02-12
79.3%
#20388: fix(failover): don't skip same-provider fallback models when cooldo...
by Limitless2023 · 2026-02-18
78.7%
#22064: fix(failover): bypass models allowlist for configured fallback models
by winston-bepresent · 2026-02-20
78.5%
#18587: fix(ollama): improve timeout handling and cooldown logic for local ...
by manthis · 2026-02-16
78.1%
#13658: fix: silent model failover with fallback notification
by taw0002 · 2026-02-10
78.0%
#19267: fix: derive failover reason from timedOut flag to prevent unknown c...
by austenstone · 2026-02-17
78.0%
#13077: fix: prevent cooldown pollution across different models on the same...
by magendary · 2026-02-10
77.9%
#16838: fix: include configured fallbacks in model allowlist
by taw0002 · 2026-02-15
77.7%