#19267: fix: derive failover reason from timedOut flag to prevent unknown cooldown cascade

by austenstone open 2026-02-17 16:00 View on GitHub →

agents

Cluster: Rate Limit Management Enhancements

## Summary - **Problem:** When an LLM request times out but the raw assistant error message is empty/doesn't contain timeout keywords, `assistantFailoverReason` resolves to `null` and the FailoverError reason falls through to `"unknown"` instead of `"timeout"`. - **Why it matters:** An `"unknown"` reason triggers a profile cooldown (unlike `"timeout"`, which correctly skips cooldown). Since all models on `github-copilot` share the same auth profile, a single timeout on one model locks out **all** models on the provider — a full cascade failure. - **What changed:** The FailoverError reason derivation now checks `timedOut` flag and re-classifies from the formatted message before falling back to `"unknown"`. The chain is: `assistantFailoverReason` → `timedOut` flag → `classifyFailoverReason(message)` → `"unknown"`. - **What did NOT change:** The `markAuthProfileFailure` call above already correctly checks `timedOut` — only the FailoverError thrown for model-fallback was affected. Prompt-error path is unaffected (it classifies from `errorText` directly). ## Change Type (select all) - [x] Bug fix ## Scope (select all touched areas) - [x] Gateway / orchestration - [x] Auth / tokens ## User-visible / Behavior Changes - Timeouts no longer trigger provider-wide cooldown lockouts. Before: a single timeout on `claude-opus-4.6` would lock out all 5 fallback models. After: timeout is correctly classified, cooldown is skipped, and other models remain available. - FailoverError reason now shows `(timeout)` instead of `(unknown)` for timed-out requests. ## Security Impact (required) - New permissions/capabilities? `No` - Secrets/tokens handling changed? `No` - New/changed network calls? `No` - Command/tool execution surface changed? `No` - Data access scope changed? `No` ## Repro + Verification ### Environment - OS: macOS - Provider: github-copilot (all models) ### Steps 1. Send a message that causes `claude-opus-4.6-1m` to time out 2. Observe the error output ### Expected - Timeout classified as `(timeout)`, no cooldown cascade - Other models (`claude-opus-4.6-fast`, `gpt-5.3`, etc.) still available ### Actual (before fix) - Timeout classified as `(unknown)`, triggers profile cooldown - All remaining models show `"No available auth profile for github-copilot (all in cooldown or unavailable). (rate_limit)"` ## Evidence Error logs before fix: ``` ⚠️ Agent failed before reply: All models failed (5): github-copilot/claude-opus-4.6-1m: LLM request timed out. (unknown) github-copilot/claude-opus-4.6-fast: No available auth profile for github-copilot (all in cooldown or unavailable). (rate_limit) github-copilot/claude-opus-4.6: No available auth profile ... (rate_limit) github-copilot/gpt-5.3-codex: No available auth profile ... (rate_limit) github-copilot/gpt-5.3: No available auth profile ... (rate_limit) ``` ## Human Verification (required) - Verified: code-level trace of `assistantFailoverReason` → `null` → `"unknown"` path - Verified: `markAuthProfileFailure` already correctly handles `timedOut` (line ~907) — only the FailoverError throw was missing the check - Edge cases: `classifyFailoverReason(message)` fallback handles cases where the formatted message contains other classifiable reasons - Not verified: live end-to-end (no way to deterministically trigger a timeout) ## Compatibility / Migration - Backward compatible? `Yes` - Config/env changes? `No` - Migration needed? `No` ## Failure Recovery (if this breaks) - Revert this single commit - No config changes needed ## Risks and Mitigations - Risk: Timeouts that were previously (incorrectly) triggering cooldowns may now retry more aggressively on the same profile. - Mitigation: This is the correct behavior — the timeout-skip-cooldown logic was already intentional (see the `params.reason !== "timeout"` guard in `usage.ts`). The FailoverError was simply not propagating the correct reason.  <h3>Greptile Summary</h3> Large PR that bundles the titled bug fix (failover reason derivation for timeouts) with significant new functionality: Copilot SDK integration (`copilot-cli` backend, SDK-based auth, model discovery), per-model cooldown scoping, reasoning-signature stripping for Copilot→Claude proxy, and message-delete dequeue for Discord/Slack. **Core fix (failover reason):** The derivation chain `assistantFailoverReason → timedOut flag → classifyFailoverReason(message) → "unknown"` in `run.ts` correctly prevents timeouts from being misclassified as `"unknown"`, which would have triggered provider-wide cooldowns. **Critical issue — per-model cooldown write path is incomplete:** - `computeNextProfileUsageStats` in `usage.ts` accepts a `modelId` parameter but **never references it** in the function body. The `modelCooldowns` map on `ProfileUsageStats` is never written to. - `rate_limit` failures with `modelId` still set profile-level `cooldownUntil` (the cascade the PR aims to prevent). `timeout` failures skip all cooldown writes. - The read path (`isProfileInCooldownForModel`) correctly checks `modelCooldowns`, but since nothing populates it, per-model scoping will not function as intended. - The tests in `auth-profiles.per-model-cooldown.test.ts` assert that `cooldownUntil` is `undefined` and `modelCooldowns` is set — these expectations are inconsistent with the current implementation and will likely fail. - `clearExpiredCooldowns` also doesn't clean up expired `modelCooldowns` entries. <h3>Confidence Score: 2/5</h3> - The core failover-reason fix is correct, but the per-model cooldown write path appears incomplete — `computeNextProfileUsageStats` never writes to `modelCooldowns` despite accepting `modelId`, so rate-limit cooldowns will still be profile-wide. - Score of 2 reflects that while the primary bug fix (timeout → "unknown" misclassification) is sound, the per-model cooldown feature has a significant gap: the write path in `computeNextProfileUsageStats` doesn't implement the `modelId`-based branching needed to populate `modelCooldowns`. This means rate-limit and timeout cooldowns will still apply at the profile level, partially negating the PR's goal of preventing cascade failures. The associated tests appear to have expectations that don't match the implementation. - `src/agents/auth-profiles/usage.ts` — `computeNextProfileUsageStats` needs the `modelId` branching logic to actually write per-model cooldowns. `src/agents/auth-profiles.per-model-cooldown.test.ts` — test expectations likely don't match current implementation. <sub>Last reviewed commit: 8aa2cb5</sub>  <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>