← Back to PRs

#19181: feat(gateway): tool pause/resume interrupts with durable approvals

by evgenyyy open 2026-02-17 13:48 View on GitHub →
docs app: macos app: web-ui gateway agents size: XL
## Summary Describe the problem and fix in 2–5 bullets: - Problem: paused tool executions had no first-class gateway interrupt primitive, so approvals could not pause/resume across process restarts. - Why it matters: approval-gated tool calls need durable, bindable, resumable state so operator decisions are reliable and replay-safe. - What changed: added persistent tool interrupt state + RPC (`tool.interrupt.emit` / `tool.interrupt.resume`), pause-for-approval tool wrapper (`wrapToolWithPauseForApproval`), and resume wait flow bound to `runId + sessionKey + toolCallId`. - What changed: resume tokens are unguessable, only token hashes are persisted, resume enforces expiry + binding + timing-safe hash compare. - What did NOT change (scope boundary): no UI workflow redesign; this PR adds protocol/runtime plumbing only. ## Change Type (select all) - [x] Bug fix - [x] Feature - [ ] Refactor - [ ] Docs - [x] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [x] Gateway / orchestration - [x] Skills / tool execution - [x] Auth / tokens - [x] Memory / storage - [ ] Integrations - [x] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #19072 - Related #19072 ## User-visible / Behavior Changes - Tools can now return `status: "paused_for_approval"` and block until resumed. - Gateway now exposes `tool.interrupt.emit` / `tool.interrupt.resume` for scoped operator approval flows. - Gateway now broadcasts `tool.interrupt.requested` / `tool.interrupt.resumed` events to `operator.approvals` clients. ## Security Impact (required) - New permissions/capabilities? (`Yes`) - Secrets/tokens handling changed? (`Yes`) - New/changed network calls? (`Yes`) - Command/tool execution surface changed? (`Yes`) - Data access scope changed? (`No`) - If any `Yes`, explain risk + mitigation: - Risk: approval resume tokens could be abused if guessable or leaked. - Mitigation: tokens are minted from cryptographically strong randomness, only SHA-256 hashes are persisted, compare is timing-safe, and resume requires strict binding (`approvalRequestId + runId + sessionKey + toolCallId`) plus expiry. ## Repro + Verification ### Environment - OS: Linux (dev workspace) - Runtime/container: Node 22 + pnpm workspace - Model/provider: N/A - Integration/channel (if any): Gateway RPC + agent tool runtime - Relevant config (redacted): default gateway state dir via `resolveStateDir()` ### Steps 1. Start gateway and emit a tool interrupt via `tool.interrupt.emit` with `approvalRequestId`, binding fields, and interrupt payload. 2. Observe `tool.interrupt.requested` event and use `resumeToken` with matching binding in `tool.interrupt.resume`. 3. Verify waiting emitter resolves with resumed result; restart gateway and confirm pending/expired state survives from `gateway/tool-interrupts.json`. ### Expected - Interrupt requests persist in gateway state dir. - Resume succeeds only for valid token + correct run/session/tool binding before expiry. - Paused tool wrapper resumes and returns final tool result. ### Actual - Implemented in code and covered by new targeted tests. - Full verification commands are blocked in this environment by npm registry DNS failures (`EAI_AGAIN`). ## Evidence Attach at least one: - [ ] Failing test/log before + passing after - [x] Trace/log snippets - [ ] Screenshot/recording - [ ] Perf numbers (if relevant) Trace snippets captured in this branch: - `pnpm install` fails in workspace with `getaddrinfo EAI_AGAIN registry.npmjs.org`. - `pnpm check` fails early with `oxfmt: not found` (dependency install incomplete). - targeted test command fails with `Command "vitest" not found` (dependency install incomplete). ## Human Verification (required) What you personally verified (not just CI), and how: - Verified scenarios: - Manual code-path review of pause extraction, emit/wait/resume flow, binding enforcement, expiry handling, persistence load/save path, and gateway method/event registration. - New unit tests authored for manager persistence/binding/expiry, method handlers, wrapper behavior, and broadcast scope gating. - Edge cases checked: - Existing `approvalRequestId` with mismatched binding is rejected. - Resume after expiry returns explicit expired error and resolves waiter as expired. - Resume token raw value is not persisted. - What you did **not** verify: - End-to-end runtime execution in this clone due blocked dependency install. ## Compatibility / Migration - Backward compatible? (`Yes`) - Config/env changes? (`No`) - Migration needed? (`No`) - If yes, exact upgrade steps: ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: - Revert this PR commit(s) to remove pause wrapper + tool interrupt RPC path. - Files/config to restore: - `src/agents/pi-tools.pause-for-approval.ts` - `src/gateway/tool-interrupt-manager.ts` - related gateway method/protocol registrations. - Known bad symptoms reviewers should watch for: - Paused tools never resume. - Resume rejected despite correct operator action (binding/token mismatch). - Interrupt events visible outside `operator.approvals` scope. ## Risks and Mitigations List only real risks for this PR. Add/remove entries as needed. If none, write `None`. - Risk: pending interrupt records could accumulate if not pruned. - Mitigation: retention windows + prune on load/emit/resume/expire. - Risk: two-phase emit callers and final-response callers may interpret response modes differently. - Mitigation: handler preserves single-response default; only emits immediate accepted response when `twoPhase=true`. - Risk: pause wrapper may throw if runtime context is missing binding fields. - Mitigation: wrapper fails fast with explicit error requiring `runId`, `sessionKey`, `toolCallId` for paused flows. ## AI Assistance Disclosure - [x] AI-assisted PR - Testing degree: lightly tested in this environment (full build/check/tests blocked by dependency install DNS failures). - I understand and can explain the code paths changed in this PR. ## Prompts / Session Notes Primary prompt context used for implementation: - "Re-implement the paused tool execution approvals PR (issue #19072) in this fresh clone/branch..." - Required items included generic paused result state, durable gateway interrupt persistence, secure resume tokens, strict run/session/tool binding, wait-for-resume semantics, new gateway RPC methods/events, tool wrapper wiring, and validation commands. <!-- greptile_comment --> <h3>Greptile Summary</h3> Adds persistent, durable tool interrupt infrastructure for pause/resume approval flows across gateway restarts. Core primitives: `ToolInterruptManager` with SHA-256 token hashing + timing-safe comparison, strict run/session/tool binding enforcement, and RPC methods `tool.interrupt.emit` / `tool.interrupt.resume`. Tools returning `status: "paused_for_approval"` are wrapped to block until resumed via `wrapToolWithPauseForApproval`. All state persists to `gateway/tool-interrupts.json` with retention policies for resumed/expired records. **Key changes:** - New `ToolInterruptManager` with token hash persistence, expiry timers, and pruning - Gateway RPC methods emit/resume with broadcast to `operator.approvals` scope - Pause-for-approval wrapper intercepts paused tool results and waits for resume - Protocol schemas define binding + interrupt payload structures - Tests cover token security, binding enforcement, expiry, and restart persistence **Security:** - Resume tokens minted from `randomBytes(32)`, only SHA-256 hashes persisted - Timing-safe comparison via `timingSafeEqual` - Strict binding validation (`approvalRequestId` + `runId` + `sessionKey` + `toolCallId`) - Scoped events prevent unauthorized operators from seeing interrupts <h3>Confidence Score: 4/5</h3> - This PR is safe to merge with minor attention to runtime context wiring - Score reflects solid security primitives (cryptographic token handling, timing-safe comparison, strict binding enforcement) and comprehensive test coverage. Implementation follows existing patterns for durable gateway state and RPC methods. One point deducted because end-to-end runtime verification was blocked by dependency install failures, and there's a small risk around ensuring `runId`/`sessionKey` are correctly threaded through all tool execution paths - Pay close attention to `src/agents/pi-tools.ts` and `src/agents/pi-embedded-runner/run/attempt.ts` to verify `runId` is correctly passed to tool context in all execution paths <sub>Last reviewed commit: f2e5a05</sub> <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs