#9025: Fix/automatic exponential backoff for LLM rate limits
cli
scripts
docker
agents
stale
Cluster:
Error Resilience and Retry Logic
1 This PR implements automatic retries with exponential backoff for LLM API calls, improving resilience against transient rate limits (429) and service
overloads.
2
3 **Changes:**
4 - **Retry Logic:** The agent runner now catches rate limit and overload errors. Instead of failing immediately or rotating keys instantly, it retries
the request using the current profile up to a configured limit.
5 - **Exponential Backoff:** Retries use a base delay of 1s, doubling with each attempt (1s, 2s, 4s...), plus 20% random jitter to prevent thundering
herds.
6 - **Configuration:** Added `agents.defaults.model.maxRetries` to `openclaw.json` schema (default: 2 retries per auth profile).
7 - **Error Handling:** Enhanced error classification to strictly identify 429/Overloaded states versus fatal context errors.
8
9 **Fixes:**
10 - Prevents aggressive API key exhaustion when a provider is simply temporarily overloaded.
11 - Smoother recovery from short burst rate limits.
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR adds automatic retry/backoff behavior around LLM calls (rate limits/overload) and introduces a new `agents.defaults.model.maxRetries` config knob. It also includes a new MCP client manager for stdio-based MCP tool discovery/execution.
The core behavior change is in the embedded agent runner: on errors classified as rate limit/overload it now sleeps with exponential backoff + jitter and retries with the same auth profile before rotating keys/profiles. Config/schema/types are updated to expose `maxRetries`.
Non-core changes also modify local Docker defaults (`docker-compose.yml`, `.gitignore`, etc.), which appear unrelated to the retry feature and may impact existing workflows.
<h3>Confidence Score: 3/5</h3>
- This PR is close to mergeable but has a few behavior/config and lifecycle issues to address first.
- Retry/backoff logic is generally straightforward, but `maxRetries` is currently unbounded (risking effectively-infinite retries and preventing profile rotation under misconfiguration), the new MCP client manager lacks guaranteed cleanup semantics, and docker-compose defaults were changed in a way that can break existing setups.
- src/agents/pi-embedded-runner/run.ts, src/config/zod-schema.agent-defaults.ts, src/mcp/client.ts, docker-compose.yml
<!-- greptile_other_comments_section -->
<sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#16239: fix: retry on transient API errors (overloaded, rate-limit, timeout)
by zerone0x · 2026-02-14
78.2%
#8256: feat: Add rate limit strategy configuration
by revenuestack · 2026-02-03
78.1%
#23152: feat(plugin): add retry-backoff extension
by cintia09 · 2026-02-22
78.0%
#16913: fix(agent): increase transient HTTP retry from 1 to 3 with escalati...
by hou-rong · 2026-02-15
77.7%
#14574: fix: gentler rate-limit cooldown backoff + clear stale cooldowns on...
by JamesEBall · 2026-02-12
77.6%
#13686: Add opt-in rate limiting and token-based budgets for external API c...
by ShresthSamyak · 2026-02-10
77.0%
#8677: fix: add retry logic to OAuth token refresh
by skyblue-will · 2026-02-04
76.2%
#9482: feat: add cloud code assist retry logic and parsing for rate limit ...
by mrcha033 · 2026-02-05
75.8%
#9232: Fix: Add automatic retry for network errors in message runs
by vishaltandale00 · 2026-02-05
75.4%
#17001: fix: retry sub-agent announcements with backoff instead of silently...
by luisecab · 2026-02-15
75.3%