#19252: fix(agents): continue model fallback on failover text payloads
commands
agents
size: M
Cluster:
Model Fallbacks and Rate Limiting
## Summary
- detect failover-shaped error payloads returned as successful run results in `runWithModelFallback`
- convert those payload-only failures into fallback retries so the chain advances instead of stopping on OpenRouter-style `402` text
- keep guardrails to avoid false positives for normal instructional text mentioning rate limits
## Testing
- `pnpm vitest run --config vitest.e2e.config.ts src/agents/model-fallback.e2e.test.ts`
- `pnpm oxlint --type-aware src/agents/model-fallback.ts src/agents/model-fallback.e2e.test.ts`
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Extends model fallback system to detect and retry when providers return failover-shaped error payloads as "successful" run results. The PR adds:
- **Payload-level failover detection** (`resolveFailoverPayloadMessage`) in `model-fallback.ts:89-134` that inspects successful run results for error text payloads and converts them to fallback retries
- **New billing error patterns** (`requires more credits`, `can only afford`) to catch OpenRouter-style 402 messages
- **Context parameter** (`ModelFallbackRunContext`) passed to all run callbacks, enabling callers to know when fallback chains are active
- **`probePrimaryDuringCooldown` configuration** set to `"always"` across auto-reply, followup, memory, and CLI flows so primary models are always attempted first (then fallback if rate-limited)
- **Cron agent model merge fix** preserving default `fallbacks` when agent configs only override `primary`
- **User-facing fallback notices** shown when billing/rate-limit causes model switching
The detection logic guards against false positives by requiring error-like signals (payload marked `isError`, stopReason `"error"`, or regex match for HTTP codes/error keywords) before treating instructional text about rate limits as actual failures.
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk
- The implementation is well-tested with comprehensive e2e tests covering both positive cases (detecting real failover payloads) and negative cases (not treating instructional text as errors). The detection logic includes multiple safeguards against false positives, all integration points are updated consistently, and the cron model merge fix has dedicated unit tests. The changes follow established patterns in the codebase.
- No files require special attention
<sub>Last reviewed commit: 29d6606</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#13658: fix: silent model failover with fallback notification
by taw0002 · 2026-02-10
84.4%
#22064: fix(failover): bypass models allowlist for configured fallback models
by winston-bepresent · 2026-02-20
84.0%
#15815: Fallback LLM doesn't trigger if primary model is local
by shihanqu · 2026-02-13
83.6%
#9427: fix: trigger model fallback on all 4xx HTTP errors
by dbottme · 2026-02-05
83.0%
#21152: fix(agents): throw FailoverError for unknown model so fallback chai...
by Mellowambience · 2026-02-19
82.9%
#8390: feat: notify user when fallback model is used (#8182)
by Glucksberg · 2026-02-04
82.5%
#19077: fix(agents): trigger model failover on connection-refused and netwo...
by ayanesakura · 2026-02-17
81.8%
#23738: feat(fallback): first-class transition visibility + low-noise autom...
by SmithLabsLLC · 2026-02-22
81.2%
#23816: fix(agents): model fallback skipped during session overrides and pr...
by ramezgaberiel · 2026-02-22
80.6%
#21049: fix(failover): treat HTTP 5xx as rate-limit for model fallback
by maximalmargin · 2026-02-19
80.2%