#9025: Fix/automatic exponential backoff for LLM rate limits

by fotorpics open 2026-02-04 19:32 View on GitHub →

cli scripts docker agents stale

Cluster: Error Resilience and Retry Logic

1 This PR implements automatic retries with exponential backoff for LLM API calls, improving resilience against transient rate limits (429) and service overloads. 2 3 **Changes:** 4 - **Retry Logic:** The agent runner now catches rate limit and overload errors. Instead of failing immediately or rotating keys instantly, it retries the request using the current profile up to a configured limit. 5 - **Exponential Backoff:** Retries use a base delay of 1s, doubling with each attempt (1s, 2s, 4s...), plus 20% random jitter to prevent thundering herds. 6 - **Configuration:** Added `agents.defaults.model.maxRetries` to `openclaw.json` schema (default: 2 retries per auth profile). 7 - **Error Handling:** Enhanced error classification to strictly identify 429/Overloaded states versus fatal context errors. 8 9 **Fixes:** 10 - Prevents aggressive API key exhaustion when a provider is simply temporarily overloaded. 11 - Smoother recovery from short burst rate limits.  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR adds automatic retry/backoff behavior around LLM calls (rate limits/overload) and introduces a new `agents.defaults.model.maxRetries` config knob. It also includes a new MCP client manager for stdio-based MCP tool discovery/execution. The core behavior change is in the embedded agent runner: on errors classified as rate limit/overload it now sleeps with exponential backoff + jitter and retries with the same auth profile before rotating keys/profiles. Config/schema/types are updated to expose `maxRetries`. Non-core changes also modify local Docker defaults (`docker-compose.yml`, `.gitignore`, etc.), which appear unrelated to the retry feature and may impact existing workflows. <h3>Confidence Score: 3/5</h3> - This PR is close to mergeable but has a few behavior/config and lifecycle issues to address first. - Retry/backoff logic is generally straightforward, but `maxRetries` is currently unbounded (risking effectively-infinite retries and preventing profile rotation under misconfiguration), the new MCP client manager lacks guaranteed cleanup semantics, and docker-compose defaults were changed in a way that can break existing setups. - src/agents/pi-embedded-runner/run.ts, src/config/zod-schema.agent-defaults.ts, src/mcp/client.ts, docker-compose.yml  <sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub>