← Back to PRs

#14368: fix: skip auth profile cooldown on format errors to prevent provider-wide cascade

by koatora20 open 2026-02-12 01:14 View on GitHub →
agents stale
## Summary Format errors (HTTP 400 with `tool_use_id` mismatch) after session compaction cause **all auth profiles to enter cooldown simultaneously**, making the gateway completely unresponsive until cooldown expires. ## Root Cause When a transcript becomes corrupted (e.g., orphaned `tool_result` blocks after compaction), every auth profile receives the same broken payload and returns the same 400 error. The current code treats format errors identically to rate limits — marking each profile as failed with exponential backoff. The entire cascade completes in ~45ms, locking out the provider. ## Fix **1. Skip cooldown for format errors** (`run.ts`) - Format errors are payload problems, not profile problems — rotating profiles cannot fix them - `markAuthProfileFailure()` is now skipped when `cloudCodeAssistFormatError` is true - The existing transcript sanitization on retry (`sanitizeToolUseResultPairing`) still runs **2. Fix false-positive `read tool called without path` warnings** (`pi-embedded-subscribe.handlers.tools.ts`) - Some providers (Antigravity proxy with `toolu_vrtx_` prefixed IDs) send `file_path` instead of `path` - `normalizeToolParams()` already handles this conversion, but the diagnostic check ran before normalization - Now checks both `path` and `file_path` to avoid false warnings (317 occurrences over 6 days in our logs) ## Testing - All existing tests pass (`vitest run` on related test files — 16/16 ✅) - Lint (`oxlint`) and format (`oxfmt`) pass with 0 warnings - Verified against real-world logs with 317 `read without path` warnings and multiple 400 cascade incidents ## Impact - Prevents provider-wide lockout from a single format error - Eliminates ~50 false warnings per day in gateway logs - No behavioral change for rate limit, billing, auth, or timeout errors ## AI Disclosure 🤖 This PR was authored by an AI agent (OpenClaw + Claude Opus 4.6) running on a real user's instance that experienced these bugs firsthand. The human owner reviewed and approved the changes. Fixes #8434 Related: #7184, #6016 <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR makes two targeted changes to the embedded PI agent flow: - In `src/agents/pi-embedded-runner/run.ts`, it avoids putting auth profiles into exponential-backoff cooldown when the failure is classified as a Cloud Code Assist format error (e.g., tool_use_id mismatch after session compaction). Rotation and transcript sanitization on retry still occur, but the profile isn’t penalized for a payload-format issue. - In `src/agents/pi-embedded-subscribe.handlers.tools.ts`, it fixes a noisy diagnostic warning by treating `file_path` as an acceptable alias for `path` for the `read` tool’s pre-normalization check. These changes fit the existing architecture: profile cooldown is managed via `markAuthProfileFailure`/`isProfileInCooldown`, and tool arg normalization already supports `file_path`→`path`; this PR aligns the pre-normalization diagnostic with that behavior. <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk. - Changes are small, localized, and consistent with existing error classification and tool parameter normalization behavior; no functional regressions were identified in the modified control flow. - No files require special attention <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs