← Back to PRs

#17559: fix: don't rotate auth profile on timeout during active tool use

by jg-noncelogic open 2026-02-15 22:18 View on GitHub →
agents size: S
Fixes #17555 Related: #14228, #11715 ## Problem When an embedded run times out while the agent is **actively executing tool calls** (e.g., long-running `exec`, sub-agents), the timeout is treated as an API rate limit stall. This triggers `markAuthProfileFailure()` with exponential backoff cooldown (1min → 5min → 25min → 1hr) and auth profile rotation — even though the API responded successfully and the agent was doing legitimate work. On single-account setups this causes a crash loop: 1. Agent spawns long-running exec (>timeoutSeconds) 2. Run timer fires → `timedOut = true` 3. `shouldRotate` evaluates true → profile marked failed → rotation attempted → no fallback 4. Session stuck in `processing` → gateway self-restarts → repeat ## Fix Check for tool activity before deciding to rotate: ```typescript const hadToolActivity = timedOut && (attempt.toolMetas.length > 0 || attempt.assistantTexts.length > 0); const shouldRotate = (!aborted && failoverFailure) || (timedOut && !timedOutDuringCompaction && !hadToolActivity); ``` **Rationale:** If `toolMetas` or `assistantTexts` are populated, the API responded and the agent was actively working. The timeout is from long tool execution, not an API stall. The existing streaming inactivity watchdog (90s silence) already catches genuine API hangs separately. This follows the same pattern as the existing `timedOutDuringCompaction` guard — both exclude cases where the timeout fired during legitimate post-response work. ## Testing - Verified on self-hosted instance running Google Antigravity / Claude Opus 4.6 - Workload: agent spawning Claude Code sub-agents via exec (turns regularly 5–15 minutes) - Before: 6 gateway restarts in 2 hours from false-positive profile cooldowns - After: stable, no false cooldowns, genuine API stalls still caught by inactivity watchdog <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR prevents false-positive auth profile rotation when an embedded run times out during active tool execution (e.g., long-running `exec` or sub-agents). Previously, any timeout triggered `markAuthProfileFailure()` with exponential backoff cooldown, even when the API had responded successfully and the agent was doing legitimate work. On single-account setups, this caused a crash loop. The fix introduces a `hadToolActivity` guard that checks whether `toolMetas` or `assistantTexts` are populated before deciding to rotate. This follows the existing `timedOutDuringCompaction` pattern — both exclude cases where the timeout fired during legitimate post-response work, while the streaming inactivity watchdog (90s silence) still catches genuine API hangs. - Adds `hadToolActivity` check to the `shouldRotate` condition in `src/agents/pi-embedded-runner/run.ts` - Adds an informational log message when rotation is skipped due to tool activity (gated on `!timedOutDuringCompaction` to avoid overlap with compaction timeout logging) - Adds two well-structured e2e tests: one verifying rotation is suppressed with tool activity, one verifying rotation still occurs without tool activity <h3>Confidence Score: 5/5</h3> - This PR is safe to merge — the change is narrowly scoped, logically sound, and well-tested. - The fix is a minimal, focused guard added to an existing condition. The logic correctly distinguishes between API stalls (no tool activity → rotate) and legitimate long-running work (tool activity present → skip rotation). It follows the existing `timedOutDuringCompaction` pattern. Both positive and negative cases are covered by new e2e tests. The log gating was correctly refined per prior review feedback. No regressions are introduced to existing behavior paths. - No files require special attention <sub>Last reviewed commit: e8312be</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs