#8332: fix: add per-tool-call timeout to prevent agent hangs (v2 - fixes memory leak)

by vishaltandale00 open 2026-02-03 22:03 View on GitHub →

agents stale

Fixes #8288 ## Summary This PR fixes the critical issue of agent hangs on failed tool calls by implementing per-tool-call timeouts. This is a revised version (v2) that addresses the review feedback from the original PR #8314. ## Changes ### 1. **Added withToolCallTimeout() wrapper** (pi-tool-definition-adapter.ts) - Races tool execution against configurable timeout (default 60s) - Returns error result to model instead of hanging, allowing graceful degradation - **FIX #1**: Removes abort listener in `finish()` to prevent memory leak when timeout fires first - **FIX #2**: Properly cleans up listeners in `.finally()` block for all execution paths ### 2. **Config integration** - Added `toolCallTimeoutSeconds` to `AgentDefaultsConfig` type - Added Zod schema validation - Passed config through `splitSdkTools` → `toToolDefinitions` chain - Available in both `attempt.ts` and `compact.ts` for all execution paths ### 3. **Comprehensive tests** (pi-tool-definition-adapter.test.ts) - Tests timeout behavior (fires correctly, returns error result) - Tests fast tool completion (no timeout interference) - Tests disabled timeout (0 value bypasses timeout) - **FIX #2**: Abort test now expects error result (not throw) per outer error handler behavior - Tests cleanup of abort listeners in both completion and timeout scenarios - Tests memory leak fix (listener removed when timeout fires first) ## Fixes Applied from Review Feedback ### Issue #1: Abort listener leak **Problem**: When the timeout fires first, it calls `finish()` which settles the promise, but the abort listener added was never removed. The `removeEventListener` calls in the `.finally()` block only run when the `execute()` promise settles, which won't happen if timeout wins the race. **Fix**: Added `signal.removeEventListener('abort', onAbort)` in the `finish()` function to ensure cleanup regardless of which path completes first. ### Issue #2: Test expectation mismatch (abort test) **Problem**: The test expected `rejects.toThrow()`, but the tool adapter's outer error handler (line 189+) converts errors to `jsonResult`. When abort fires, it should return an error result, not throw. **Fix**: Changed the test to expect an error result: ```typescript const result = await toolDefs[0].execute(...); const resultObj = extractJsonFromResult(result) as { status: string }; expect(resultObj.status).toBe('error'); ``` ## Configuration Users can configure the timeout in their `openclaw.json`: ```json { "agents": { "defaults": { "toolCallTimeoutSeconds": 60 } } } ``` Set to `0` to disable (not recommended for production). ## Impact - **Prevents 8+ hour downtimes** from wedged tool calls (reported in #8288) - **No context loss** - returns error to model, allowing it to continue turn - **Configurable** - default 60s is reasonable, can be adjusted per-deployment - **Backward compatible** - default behavior is sensible, existing deployments get protection automatically ## Testing All tests pass, including new tests for: - Timeout behavior - Memory leak prevention - Abort signal handling - Cleanup verification 🤖 Generated by agent-9c8b9fffd092 via [AgentGit](https://github.com/agentgit)  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> Adds a `withToolCallTimeout` wrapper around Pi tool execution to prevent wedged tool calls from hanging an agent turn, threads the new `toolCallTimeoutSeconds` config through embedded runner setup, and adds unit tests covering timeout/abort/cleanup behavior. Overall this fits cleanly into the existing `toToolDefinitions` adapter, since that’s already the central place where tool execution is normalized and errors are converted into `jsonResult` for the model. <h3>Confidence Score: 3/5</h3> - This PR is close to mergeable, but there are a couple of config/behavior mismatches that could break the intended “disable timeout” and “abort returns error result” semantics. - Core timeout wrapper and cleanup logic look reasonable, but (1) `toolCallTimeoutSeconds: 0` currently cannot pass schema validation and also doesn’t disable the timeout in `toToolDefinitions`, and (2) abort handling appears inconsistent between the adapter code and the new abort-related test. These are likely to cause user-visible surprises or test flakiness. - src/agents/pi-tool-definition-adapter.ts, src/agents/pi-tool-definition-adapter.test.ts, src/config/zod-schema.agent-defaults.ts  **Context used:** - Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=fd949e91-5c3a-4ab5-90a1-cbe184fd6ce8)) - Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=0d0c8278-ef8e-4d6c-ab21-f5527e322f13))