#16554: fix(agent): prevent session deadlock on timeout during tool execution
agents
size: L
Cluster:
Cron Job Stability Fixes
## Summary
Prevents incorrect auth profile rotation when timeouts occur during long-running tool execution (e.g., SSH commands, file operations >600s). This PR implements tool execution timeout classification using the same proven patterns from the compaction timeout fix (PR #9855).
Refs #14228
## Problem
When an agent run exceeds the 600s timeout **during tool execution** (not compaction), the system incorrectly rotates auth profiles as if the timeout was caused by a provider issue. This wastes auth profiles and can lead to exhaustion. Additionally, tool execution state may not be properly cleaned up, leading to resource leaks.
### Environment
- Observed in production on long-running tools (SSH exec, file transfers, process monitoring)
- Default 600s timeout insufficient for legitimate operations
- Auth profile rotation triggered incorrectly (tool timeout ≠ provider issue)
## Root Cause
Tool execution events flow through async handlers (`handleToolExecutionStart`, `handleToolExecutionEnd`) that:
1. May not check the abort/unsubscribed flag at all async boundaries
2. Don't distinguish tool execution timeouts from provider timeouts
3. Can leave session state inconsistent with concurrent tools
4. Cause unnecessary auth profile rotation (tool timeout ≠ provider issue)
## Solution
Following the proven pattern from PR #9855 (compaction timeout fix):
### 1. Reference-Counted State Tracking
- **Reference counting** (`toolExecutionCount`) to properly handle concurrent tools
- **Snapshot state** (`activeToolName`, `activeToolCallId`, `activeToolStartTime`) tracks most recent tool for timeout classification
- `toolExecutionInFlight` flag derived from reference count (`count > 0`)
- `toolStartData` map moved from module-level to context state (prevents cross-session corruption)
### 2. Timeout Classification
- Add `timedOutDuringToolExecution` flag (mirrors `timedOutDuringCompaction` from PR #9855)
- Classify timeouts based on execution stage:
- During tool execution → `timedOutDuringToolExecution = true`
- During compaction → `timedOutDuringCompaction = true` (from PR #9855)
- During LLM call → neither flag set
- Capture tool execution snapshot in **both timeout handlers** (internal + external abort):
- Internal timeout handler captures snapshot before classification
- External abort handler captures snapshot conditionally (if not already set)
- Prevents spurious warnings about missing tool state
- Validate snapshot quality before trusting classification (string lengths, duration bounds)
### 3. Race Condition Protection
- **Three `unsubscribed` checks** at async boundaries in `handleToolExecutionStart`:
1. **Before any work** - early return if already unsubscribed
2. **After flush operations** - early return if unsubscribed during flush
3. **After all state mutations** - cleanup all state and return if unsubscribed after setting state
- Cleans up: toolExecutionCount, toolExecutionInFlight, snapshot state, toolStartData, toolMetaById, toolSummaryById, pending messaging state (texts, targets, mediaUrls)
- **Conditional snapshot clearing** - only clear `activeToolName/CallId/StartTime` if `toolCallId` matches current
- **Finally block** in `handleToolExecutionEnd` always cleans up to prevent memory leaks
- Early return in `handleToolExecutionEnd` if unsubscribed
- **Complete pending state cleanup** in unsubscribe - clears all pending messaging maps (texts, targets, mediaUrls)
### 4. Concurrent Tool Support
- **Reference counting** increments on start, decrements on end
- `Math.max(0, count - 1)` prevents negative counts on duplicate calls
- `toolExecutionInFlight` consistently derived from `count > 0` everywhere
- Snapshot fields track only the most recent tool (for timeout debugging context)
- Reference count tracks all active tools (for accurate timeout classification)
### 5. Skip Auth Profile Rotation
- Tool timeouts are infrastructure issues, not provider issues
- Updated rotation logic: `shouldRotate = (!aborted && failoverFailure) || (timedOut && !timedOutDuringCompaction && !timedOutDuringToolExecution)`
## Changes
### Modified Files (11 files)
**Type Definitions:**
- `src/agents/pi-embedded-subscribe.handlers.types.ts` - Add reference count, snapshot state, toolStartData map, unsubscribed flag to ToolHandlerState
- `src/agents/pi-embedded-subscribe.types.ts` - Add `ActiveToolExecutionState` public API type
- `src/agents/pi-embedded-runner/run/types.ts` - Add `timedOutDuringToolExecution` and `ToolExecutionSnapshot` with validation docs
**Core Logic:**
- `src/agents/pi-embedded-subscribe.handlers.tools.ts` - Reference counting, three unsubscribed checks (including third check with full cleanup), conditional clearing,
finally blocks
- `src/agents/pi-embedded-runner/run/attempt.ts` - Timeout classification with snapshot capture in both internal and external abort handlers, validation
- `src/agents/pi-embedded-subscribe.ts` - Initialize state, cleanup in unsubscribe/reset (including pendingMessagingMediaUrls), expose `isToolExecutionInFlight()` and `
getActiveToolExecutionState()` getters
- `src/agents/pi-embedded-runner/run.ts` - Skip rotation on tool timeouts
**Test Files & Fixtures:**
- `src/agents/pi-embedded-runner.run-embedded-pi-agent.auth-profile-rotation.e2e.test.ts` - Updated for new timeout behavior
- `src/agents/pi-embedded-runner/run.overflow-compaction.fixture.ts` - Add timedOutDuringToolExecution field
- `src/agents/pi-embedded-subscribe.handlers.tools.test.ts` - Add tool execution state fields to mocks
- `src/agents/pi-embedded-subscribe.handlers.tools.media.test.ts` - Add tool execution state fields to mocks
### Key Code Patterns
**Reference counting for concurrent tools:**
```typescript
// Increment on start
ctx.state.toolExecutionCount++;
ctx.state.toolExecutionInFlight = ctx.state.toolExecutionCount > 0;
// Decrement on end (in finally block)
ctx.state.toolExecutionCount = Math.max(0, ctx.state.toolExecutionCount - 1);
ctx.state.toolExecutionInFlight = ctx.state.toolExecutionCount > 0;
```
**Three unsubscribed checks in handleToolExecutionStart:**
```typescript
// Check 1: Before any work
if (ctx.state.unsubscribed) {
ctx.log.debug(`tool_execution_start skipped (unsubscribed): tool=${String(evt.toolName)}`);
return;
}
// ... flush operations ...
// Check 2: After async flush
if (ctx.state.unsubscribed) {
ctx.log.debug(`tool_execution_start skipped (unsubscribed after flush): tool=${toolName}`);
return;
}
// ... set state (toolExecutionCount, snapshot, toolStartData, etc.) ...
// Check 3: After all state mutations with full cleanup
if (ctx.state.unsubscribed) {
ctx.log.debug(`tool_execution_start cleanup (unsubscribed after state set): tool=${toolName}`);
// Clean up all state we just set
ctx.state.toolExecutionCount = Math.max(0, ctx.state.toolExecutionCount - 1);
ctx.state.toolExecutionInFlight = ctx.state.toolExecutionCount > 0;
if (ctx.state.activeToolCallId === toolCallId) {
ctx.state.activeToolName = undefined;
ctx.state.activeToolCallId = undefined;
ctx.state.activeToolStartTime = undefined;
}
ctx.state.toolStartData.delete(toolCallId);
ctx.state.toolMetaById.delete(toolCallId);
ctx.state.toolSummaryById.delete(toolCallId);
ctx.state.pendingMessagingTexts.delete(toolCallId);
ctx.state.pendingMessagingTargets.delete(toolCallId);
ctx.state.pendingMessagingMediaUrls.delete(toolCallId);
return;
}
```
**Snapshot capture in both timeout handlers:**
```typescript
// Internal timeout handler
const abortTimer = setTimeout(() => {
// Capture tool execution state right before classification
toolStateSnapshot = subscription.getActiveToolExecutionState();
if (subscription.isToolExecutionInFlight()) {
timedOutDuringToolExecution = true;
}
abortRun(true);
}, timeoutMs);
// External abort handler
const onAbort = () => {
const timeout = reason ? isTimeoutError(reason) : false;
if (timeout && subscription.isToolExecutionInFlight()) {
// Capture snapshot if not already captured
if (!toolStateSnapshot) {
toolStateSnapshot = subscription.getActiveToolExecutionState();
}
timedOutDuringToolExecution = true;
}
abortRun(timeout, reason);
};
```
## Testing Strategy
### Code Review
- ✅ Multiple rounds of code review
- ✅ Verified reference counting handles concurrent tools correctly
- ✅ Verified three unsubscribed checks prevent race conditions (including third check with complete cleanup)
- ✅ Verified conditional clearing prevents premature snapshot loss
- ✅ Verified finally block prevents memory leaks
- ✅ Verified consistent derivation of toolExecutionInFlight from count
- ✅ Verified snapshot capture in both internal and external abort handlers
- ✅ Verified complete cleanup of all pending messaging state (texts, targets, mediaUrls)
- ✅ Verified ToolHandlerState type includes all new fields
### Integration Tests
- E2E tests updated to expect no auth profile rotation on tool timeouts
- Test fixtures updated with timedOutDuringToolExecution field
- Test mocks updated with tool execution state fields (toolStartData, toolExecutionCount, etc.)
- Existing timeout tests verify classification logic
## Deployment Notes
- **Based on:** upstream/main (includes PR #9855)
- **Risk:** Low - extends existing pattern with careful race condition handling
- **Rollback:** Revert to previous deployment if issues arise
## Recent Fixes (Post-Rebase)
After rebasing onto upstream/main, the following additional fixes were applied:
1. **Third unsubscribed check** - Added complete cleanup in third check to prevent dirty state if unsubscribe occurs after state is set but before function completes
2. **Snapshot capture in external abort handler** - Prevents spurious warning "timeout flag set but no v...
Most Similar PRs
#17559: fix: don't rotate auth profile on timeout during active tool use
by jg-noncelogic · 2026-02-15
74.7%
#8332: fix: add per-tool-call timeout to prevent agent hangs (v2 - fixes m...
by vishaltandale00 · 2026-02-03
71.0%
#23309: fix: remove 30-minute timeout for background exec sessions
by vksvikas · 2026-02-22
64.6%
#19422: fix: pass session context to plugin tool hooks in toToolDefinitions
by namabile · 2026-02-17
64.3%
#14824: fix: do not trigger provider cooldown on LLM request timeouts
by CyberSinister · 2026-02-12
64.0%
#14368: fix: skip auth profile cooldown on format errors to prevent provide...
by koatora20 · 2026-02-12
63.6%
#19024: fix: Fix normalise toolid
by chetaniitbhilai · 2026-02-17
63.6%
#15050: fix: transcript corruption resilience — strip aborted tool_use bloc...
by yashchitneni · 2026-02-12
63.5%
#17730: feat: per-tool softTrim overrides for context pruning
by IrriVisionTechnologies · 2026-02-16
62.9%
#9085: fix: improve stability for terminated responses and telegram retries
by vladdick88 · 2026-02-04
62.8%