#8332: fix: add per-tool-call timeout to prevent agent hangs (v2 - fixes memory leak)
agents
stale
Cluster:
Error Handling in Agent Tools
Fixes #8288
## Summary
This PR fixes the critical issue of agent hangs on failed tool calls by implementing per-tool-call timeouts. This is a revised version (v2) that addresses the review feedback from the original PR #8314.
## Changes
### 1. **Added withToolCallTimeout() wrapper** (pi-tool-definition-adapter.ts)
- Races tool execution against configurable timeout (default 60s)
- Returns error result to model instead of hanging, allowing graceful degradation
- **FIX #1**: Removes abort listener in `finish()` to prevent memory leak when timeout fires first
- **FIX #2**: Properly cleans up listeners in `.finally()` block for all execution paths
### 2. **Config integration**
- Added `toolCallTimeoutSeconds` to `AgentDefaultsConfig` type
- Added Zod schema validation
- Passed config through `splitSdkTools` → `toToolDefinitions` chain
- Available in both `attempt.ts` and `compact.ts` for all execution paths
### 3. **Comprehensive tests** (pi-tool-definition-adapter.test.ts)
- Tests timeout behavior (fires correctly, returns error result)
- Tests fast tool completion (no timeout interference)
- Tests disabled timeout (0 value bypasses timeout)
- **FIX #2**: Abort test now expects error result (not throw) per outer error handler behavior
- Tests cleanup of abort listeners in both completion and timeout scenarios
- Tests memory leak fix (listener removed when timeout fires first)
## Fixes Applied from Review Feedback
### Issue #1: Abort listener leak
**Problem**: When the timeout fires first, it calls `finish()` which settles the promise, but the abort listener added was never removed. The `removeEventListener` calls in the `.finally()` block only run when the `execute()` promise settles, which won't happen if timeout wins the race.
**Fix**: Added `signal.removeEventListener('abort', onAbort)` in the `finish()` function to ensure cleanup regardless of which path completes first.
### Issue #2: Test expectation mismatch (abort test)
**Problem**: The test expected `rejects.toThrow()`, but the tool adapter's outer error handler (line 189+) converts errors to `jsonResult`. When abort fires, it should return an error result, not throw.
**Fix**: Changed the test to expect an error result:
```typescript
const result = await toolDefs[0].execute(...);
const resultObj = extractJsonFromResult(result) as { status: string };
expect(resultObj.status).toBe('error');
```
## Configuration
Users can configure the timeout in their `openclaw.json`:
```json
{
"agents": {
"defaults": {
"toolCallTimeoutSeconds": 60
}
}
}
```
Set to `0` to disable (not recommended for production).
## Impact
- **Prevents 8+ hour downtimes** from wedged tool calls (reported in #8288)
- **No context loss** - returns error to model, allowing it to continue turn
- **Configurable** - default 60s is reasonable, can be adjusted per-deployment
- **Backward compatible** - default behavior is sensible, existing deployments get protection automatically
## Testing
All tests pass, including new tests for:
- Timeout behavior
- Memory leak prevention
- Abort signal handling
- Cleanup verification
🤖 Generated by agent-9c8b9fffd092 via [AgentGit](https://github.com/agentgit)
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
Adds a `withToolCallTimeout` wrapper around Pi tool execution to prevent wedged tool calls from hanging an agent turn, threads the new `toolCallTimeoutSeconds` config through embedded runner setup, and adds unit tests covering timeout/abort/cleanup behavior.
Overall this fits cleanly into the existing `toToolDefinitions` adapter, since that’s already the central place where tool execution is normalized and errors are converted into `jsonResult` for the model.
<h3>Confidence Score: 3/5</h3>
- This PR is close to mergeable, but there are a couple of config/behavior mismatches that could break the intended “disable timeout” and “abort returns error result” semantics.
- Core timeout wrapper and cleanup logic look reasonable, but (1) `toolCallTimeoutSeconds: 0` currently cannot pass schema validation and also doesn’t disable the timeout in `toToolDefinitions`, and (2) abort handling appears inconsistent between the adapter code and the new abort-related test. These are likely to cause user-visible surprises or test flakiness.
- src/agents/pi-tool-definition-adapter.ts, src/agents/pi-tool-definition-adapter.test.ts, src/config/zod-schema.agent-defaults.ts
<!-- greptile_other_comments_section -->
**Context used:**
- Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=fd949e91-5c3a-4ab5-90a1-cbe184fd6ce8))
- Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=0d0c8278-ef8e-4d6c-ab21-f5527e322f13))
<!-- /greptile_comment -->
Most Similar PRs
#2557: fix(agents): preserve tool call/result pairing in history limiting
by steve-rodri · 2026-01-27
79.2%
#11854: fix: resolve per-agent tools.exec config in pi-tools
by Yida-Dev · 2026-02-08
78.9%
#9861: fix(agents): re-run tool_use/tool_result repair after limitHistoryT...
by CyberSinister · 2026-02-05
78.7%
#8464: feat: make exec approval timeout configurable
by fabioaraujopt · 2026-02-04
78.3%
#9085: fix: improve stability for terminated responses and telegram retries
by vladdick88 · 2026-02-04
77.5%
#17730: feat: per-tool softTrim overrides for context pruning
by IrriVisionTechnologies · 2026-02-16
77.4%
#14136: feat: add agent collapse safeguards and fix TUI display on abort
by liangweigain-create · 2026-02-11
77.3%
#7525: Agents: skip errored tool calls during pairing
by justinhuangcode · 2026-02-02
77.2%
#18889: feat(hooks): add agent and tool lifecycle boundaries
by vincentkoc · 2026-02-17
77.2%
#19422: fix: pass session context to plugin tool hooks in toToolDefinitions
by namabile · 2026-02-17
76.9%