#16554: fix(agent): prevent session deadlock on timeout during tool execution

by mverrilli open 2026-02-14 21:20 View on GitHub →

agents size: L

## Summary Prevents incorrect auth profile rotation when timeouts occur during long-running tool execution (e.g., SSH commands, file operations >600s). This PR implements tool execution timeout classification using the same proven patterns from the compaction timeout fix (PR #9855). Refs #14228 ## Problem When an agent run exceeds the 600s timeout **during tool execution** (not compaction), the system incorrectly rotates auth profiles as if the timeout was caused by a provider issue. This wastes auth profiles and can lead to exhaustion. Additionally, tool execution state may not be properly cleaned up, leading to resource leaks. ### Environment - Observed in production on long-running tools (SSH exec, file transfers, process monitoring) - Default 600s timeout insufficient for legitimate operations - Auth profile rotation triggered incorrectly (tool timeout ≠ provider issue) ## Root Cause Tool execution events flow through async handlers (`handleToolExecutionStart`, `handleToolExecutionEnd`) that: 1. May not check the abort/unsubscribed flag at all async boundaries 2. Don't distinguish tool execution timeouts from provider timeouts 3. Can leave session state inconsistent with concurrent tools 4. Cause unnecessary auth profile rotation (tool timeout ≠ provider issue) ## Solution Following the proven pattern from PR #9855 (compaction timeout fix): ### 1. Reference-Counted State Tracking - **Reference counting** (`toolExecutionCount`) to properly handle concurrent tools - **Snapshot state** (`activeToolName`, `activeToolCallId`, `activeToolStartTime`) tracks most recent tool for timeout classification - `toolExecutionInFlight` flag derived from reference count (`count > 0`) - `toolStartData` map moved from module-level to context state (prevents cross-session corruption) ### 2. Timeout Classification - Add `timedOutDuringToolExecution` flag (mirrors `timedOutDuringCompaction` from PR #9855) - Classify timeouts based on execution stage: - During tool execution → `timedOutDuringToolExecution = true` - During compaction → `timedOutDuringCompaction = true` (from PR #9855) - During LLM call → neither flag set - Capture tool execution snapshot in **both timeout handlers** (internal + external abort): - Internal timeout handler captures snapshot before classification - External abort handler captures snapshot conditionally (if not already set) - Prevents spurious warnings about missing tool state - Validate snapshot quality before trusting classification (string lengths, duration bounds) ### 3. Race Condition Protection - **Three `unsubscribed` checks** at async boundaries in `handleToolExecutionStart`: 1. **Before any work** - early return if already unsubscribed 2. **After flush operations** - early return if unsubscribed during flush 3. **After all state mutations** - cleanup all state and return if unsubscribed after setting state - Cleans up: toolExecutionCount, toolExecutionInFlight, snapshot state, toolStartData, toolMetaById, toolSummaryById, pending messaging state (texts, targets, mediaUrls) - **Conditional snapshot clearing** - only clear `activeToolName/CallId/StartTime` if `toolCallId` matches current - **Finally block** in `handleToolExecutionEnd` always cleans up to prevent memory leaks - Early return in `handleToolExecutionEnd` if unsubscribed - **Complete pending state cleanup** in unsubscribe - clears all pending messaging maps (texts, targets, mediaUrls) ### 4. Concurrent Tool Support - **Reference counting** increments on start, decrements on end - `Math.max(0, count - 1)` prevents negative counts on duplicate calls - `toolExecutionInFlight` consistently derived from `count > 0` everywhere - Snapshot fields track only the most recent tool (for timeout debugging context) - Reference count tracks all active tools (for accurate timeout classification) ### 5. Skip Auth Profile Rotation - Tool timeouts are infrastructure issues, not provider issues - Updated rotation logic: `shouldRotate = (!aborted && failoverFailure) || (timedOut && !timedOutDuringCompaction && !timedOutDuringToolExecution)` ## Changes ### Modified Files (11 files) **Type Definitions:** - `src/agents/pi-embedded-subscribe.handlers.types.ts` - Add reference count, snapshot state, toolStartData map, unsubscribed flag to ToolHandlerState - `src/agents/pi-embedded-subscribe.types.ts` - Add `ActiveToolExecutionState` public API type - `src/agents/pi-embedded-runner/run/types.ts` - Add `timedOutDuringToolExecution` and `ToolExecutionSnapshot` with validation docs **Core Logic:** - `src/agents/pi-embedded-subscribe.handlers.tools.ts` - Reference counting, three unsubscribed checks (including third check with full cleanup), conditional clearing, finally blocks - `src/agents/pi-embedded-runner/run/attempt.ts` - Timeout classification with snapshot capture in both internal and external abort handlers, validation - `src/agents/pi-embedded-subscribe.ts` - Initialize state, cleanup in unsubscribe/reset (including pendingMessagingMediaUrls), expose `isToolExecutionInFlight()` and ` getActiveToolExecutionState()` getters - `src/agents/pi-embedded-runner/run.ts` - Skip rotation on tool timeouts **Test Files & Fixtures:** - `src/agents/pi-embedded-runner.run-embedded-pi-agent.auth-profile-rotation.e2e.test.ts` - Updated for new timeout behavior - `src/agents/pi-embedded-runner/run.overflow-compaction.fixture.ts` - Add timedOutDuringToolExecution field - `src/agents/pi-embedded-subscribe.handlers.tools.test.ts` - Add tool execution state fields to mocks - `src/agents/pi-embedded-subscribe.handlers.tools.media.test.ts` - Add tool execution state fields to mocks ### Key Code Patterns **Reference counting for concurrent tools:** ```typescript // Increment on start ctx.state.toolExecutionCount++; ctx.state.toolExecutionInFlight = ctx.state.toolExecutionCount > 0; // Decrement on end (in finally block) ctx.state.toolExecutionCount = Math.max(0, ctx.state.toolExecutionCount - 1); ctx.state.toolExecutionInFlight = ctx.state.toolExecutionCount > 0; ``` **Three unsubscribed checks in handleToolExecutionStart:** ```typescript // Check 1: Before any work if (ctx.state.unsubscribed) { ctx.log.debug(`tool_execution_start skipped (unsubscribed): tool=${String(evt.toolName)}`); return; } // ... flush operations ... // Check 2: After async flush if (ctx.state.unsubscribed) { ctx.log.debug(`tool_execution_start skipped (unsubscribed after flush): tool=${toolName}`); return; } // ... set state (toolExecutionCount, snapshot, toolStartData, etc.) ... // Check 3: After all state mutations with full cleanup if (ctx.state.unsubscribed) { ctx.log.debug(`tool_execution_start cleanup (unsubscribed after state set): tool=${toolName}`); // Clean up all state we just set ctx.state.toolExecutionCount = Math.max(0, ctx.state.toolExecutionCount - 1); ctx.state.toolExecutionInFlight = ctx.state.toolExecutionCount > 0; if (ctx.state.activeToolCallId === toolCallId) { ctx.state.activeToolName = undefined; ctx.state.activeToolCallId = undefined; ctx.state.activeToolStartTime = undefined; } ctx.state.toolStartData.delete(toolCallId); ctx.state.toolMetaById.delete(toolCallId); ctx.state.toolSummaryById.delete(toolCallId); ctx.state.pendingMessagingTexts.delete(toolCallId); ctx.state.pendingMessagingTargets.delete(toolCallId); ctx.state.pendingMessagingMediaUrls.delete(toolCallId); return; } ``` **Snapshot capture in both timeout handlers:** ```typescript // Internal timeout handler const abortTimer = setTimeout(() => { // Capture tool execution state right before classification toolStateSnapshot = subscription.getActiveToolExecutionState(); if (subscription.isToolExecutionInFlight()) { timedOutDuringToolExecution = true; } abortRun(true); }, timeoutMs); // External abort handler const onAbort = () => { const timeout = reason ? isTimeoutError(reason) : false; if (timeout && subscription.isToolExecutionInFlight()) { // Capture snapshot if not already captured if (!toolStateSnapshot) { toolStateSnapshot = subscription.getActiveToolExecutionState(); } timedOutDuringToolExecution = true; } abortRun(timeout, reason); }; ``` ## Testing Strategy ### Code Review - ✅ Multiple rounds of code review - ✅ Verified reference counting handles concurrent tools correctly - ✅ Verified three unsubscribed checks prevent race conditions (including third check with complete cleanup) - ✅ Verified conditional clearing prevents premature snapshot loss - ✅ Verified finally block prevents memory leaks - ✅ Verified consistent derivation of toolExecutionInFlight from count - ✅ Verified snapshot capture in both internal and external abort handlers - ✅ Verified complete cleanup of all pending messaging state (texts, targets, mediaUrls) - ✅ Verified ToolHandlerState type includes all new fields ### Integration Tests - E2E tests updated to expect no auth profile rotation on tool timeouts - Test fixtures updated with timedOutDuringToolExecution field - Test mocks updated with tool execution state fields (toolStartData, toolExecutionCount, etc.) - Existing timeout tests verify classification logic ## Deployment Notes - **Based on:** upstream/main (includes PR #9855) - **Risk:** Low - extends existing pattern with careful race condition handling - **Rollback:** Revert to previous deployment if issues arise ## Recent Fixes (Post-Rebase) After rebasing onto upstream/main, the following additional fixes were applied: 1. **Third unsubscribed check** - Added complete cleanup in third check to prevent dirty state if unsubscribe occurs after state is set but before function completes 2. **Snapshot capture in external abort handler** - Prevents spurious warning "timeout flag set but no v...