#14368: fix: skip auth profile cooldown on format errors to prevent provider-wide cascade
agents
stale
Cluster:
Error Handling in Agent Tools
## Summary
Format errors (HTTP 400 with `tool_use_id` mismatch) after session compaction cause **all auth profiles to enter cooldown simultaneously**, making the gateway completely unresponsive until cooldown expires.
## Root Cause
When a transcript becomes corrupted (e.g., orphaned `tool_result` blocks after compaction), every auth profile receives the same broken payload and returns the same 400 error. The current code treats format errors identically to rate limits — marking each profile as failed with exponential backoff. The entire cascade completes in ~45ms, locking out the provider.
## Fix
**1. Skip cooldown for format errors** (`run.ts`)
- Format errors are payload problems, not profile problems — rotating profiles cannot fix them
- `markAuthProfileFailure()` is now skipped when `cloudCodeAssistFormatError` is true
- The existing transcript sanitization on retry (`sanitizeToolUseResultPairing`) still runs
**2. Fix false-positive `read tool called without path` warnings** (`pi-embedded-subscribe.handlers.tools.ts`)
- Some providers (Antigravity proxy with `toolu_vrtx_` prefixed IDs) send `file_path` instead of `path`
- `normalizeToolParams()` already handles this conversion, but the diagnostic check ran before normalization
- Now checks both `path` and `file_path` to avoid false warnings (317 occurrences over 6 days in our logs)
## Testing
- All existing tests pass (`vitest run` on related test files — 16/16 ✅)
- Lint (`oxlint`) and format (`oxfmt`) pass with 0 warnings
- Verified against real-world logs with 317 `read without path` warnings and multiple 400 cascade incidents
## Impact
- Prevents provider-wide lockout from a single format error
- Eliminates ~50 false warnings per day in gateway logs
- No behavioral change for rate limit, billing, auth, or timeout errors
## AI Disclosure
🤖 This PR was authored by an AI agent (OpenClaw + Claude Opus 4.6) running on a real user's instance that experienced these bugs firsthand. The human owner reviewed and approved the changes.
Fixes #8434
Related: #7184, #6016
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR makes two targeted changes to the embedded PI agent flow:
- In `src/agents/pi-embedded-runner/run.ts`, it avoids putting auth profiles into exponential-backoff cooldown when the failure is classified as a Cloud Code Assist format error (e.g., tool_use_id mismatch after session compaction). Rotation and transcript sanitization on retry still occur, but the profile isn’t penalized for a payload-format issue.
- In `src/agents/pi-embedded-subscribe.handlers.tools.ts`, it fixes a noisy diagnostic warning by treating `file_path` as an acceptable alias for `path` for the `read` tool’s pre-normalization check.
These changes fit the existing architecture: profile cooldown is managed via `markAuthProfileFailure`/`isProfileInCooldown`, and tool arg normalization already supports `file_path`→`path`; this PR aligns the pre-normalization diagnostic with that behavior.
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk.
- Changes are small, localized, and consistent with existing error classification and tool parameter normalization behavior; no functional regressions were identified in the modified control flow.
- No files require special attention
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#15050: fix: transcript corruption resilience — strip aborted tool_use bloc...
by yashchitneni · 2026-02-12
87.2%
#12487: fix(agents): strip orphaned tool_result when tool_use is sanitized ...
by skylarkoo7 · 2026-02-09
83.2%
#14328: fix: strip incomplete tool_use blocks from errored/aborted messages...
by Kropiunig · 2026-02-12
82.1%
#9861: fix(agents): re-run tool_use/tool_result repair after limitHistoryT...
by CyberSinister · 2026-02-05
81.3%
#18902: fix: exempt format errors from auth profile cooldown
by tag-assistant · 2026-02-17
81.2%
#14574: fix: gentler rate-limit cooldown backoff + clear stale cooldowns on...
by JamesEBall · 2026-02-12
81.0%
#21195: fix: suppress orphaned tool_use/tool_result errors after session co...
by ruslansychov-git · 2026-02-19
81.0%
#14824: fix: do not trigger provider cooldown on LLM request timeouts
by CyberSinister · 2026-02-12
80.5%
#8270: fix: support snake_case 'tool_use' in transcript repair (#8264)
by heliosarchitect · 2026-02-03
80.1%
#23210: fix: avoid cooldown on timeout/unknown failovers
by nydamon · 2026-02-22
79.9%