#22660: feat(agents): prioritize fallback-chain recovery and configurable re-probe interval

by sauerdaniel open 2026-02-21 13:30 View on GitHub →

commands agents size: M

Cluster: Model Fallbacks and Rate Limiting

## Summary This PR improves model failover recovery behavior and makes recovery probing configurable. When an agent is currently running on a lower-priority fallback model, OpenClaw now re-evaluates the configured priority chain and promotes to the highest-priority currently available model: - `primary` - `fallback[0]` - `fallback[1]` - ... ## Why Agents could remain on lower fallbacks (for example `f3`) longer than necessary after upstream limits recovered, even when a higher-priority model (`primary` or `f1`) was available again. ## What Changed ### Runtime behavior - Added fallback-chain re-promotion in `runWithModelFallback`. - If current model is a configured fallback (and no explicit override chain is supplied), candidate order is rebuilt to configured priority order (`primary -> fallbacks...`). - Kept short probe throttle (`30s`) and near-expiry probing. - Added periodic probing during cooldown windows to avoid stale cooldown metadata keeping agents on lower-priority models. ### Config Added a configurable periodic probe interval: - `agents.defaults.model.primaryRecoveryProbeEvery` - `agents.list[].model.primaryRecoveryProbeEvery` (per-agent override) Duration strings are validated (default unit: minutes), e.g. `45s`, `3m`, `1h`. Default remains `5m` when unset. ### Integration wiring Plumbed resolved probe interval through all relevant call paths: - command agent runs - followup/auto-reply runs - isolated cron agent runs ### Schema/docs/tests - Updated config types and zod schemas. - Added schema help + labels for new config fields. - Added/updated unit and e2e coverage for re-promotion/probing behavior. - Fixed affected cron skill-filter test mock to include the new resolver export. ## Backward Compatibility - No breaking config changes. - Explicit `fallbacksOverride` behavior remains preserved. - Existing model primary/fallback definitions continue to work unchanged. ## AI Assistance - AI-assisted: yes (Codex) - Degree of testing: fully tested locally (build/check/test) - I understand and verified the implemented behavior and changed call paths. - Session logs/prompts can be shared on request. ## Local Validation Ran and passed: - `pnpm build` - `pnpm check` - `env -u OPENCLAW_HOME -u OPENCLAW_STATE_DIR HOME="$(mktemp -d)" pnpm test` Also ran targeted suites during implementation: - `pnpm vitest run --config vitest.unit.config.ts src/agents/model-fallback.probe.test.ts` - `pnpm vitest run --config vitest.e2e.config.ts src/agents/model-fallback.e2e.test.ts` - `pnpm vitest run --config vitest.e2e.config.ts src/agents/agent-scope.e2e.test.ts src/commands/agent.e2e.test.ts` - `pnpm vitest run --config vitest.unit.config.ts src/auto-reply/reply/agent-runner-utils.test.ts src/config/config.schema-regressions.test.ts src/auto-reply/reply/followup-runner.test.ts` - `pnpm vitest run --config vitest.unit.config.ts src/cron/isolated-agent/run.skill-filter.test.ts` - `pnpm -s tsc --noEmit`