#19181: feat(gateway): tool pause/resume interrupts with durable approvals
docs
app: macos
app: web-ui
gateway
agents
size: XL
Cluster:
Windows Gateway Enhancements
## Summary
Describe the problem and fix in 2–5 bullets:
- Problem: paused tool executions had no first-class gateway interrupt primitive, so approvals could not pause/resume across process restarts.
- Why it matters: approval-gated tool calls need durable, bindable, resumable state so operator decisions are reliable and replay-safe.
- What changed: added persistent tool interrupt state + RPC (`tool.interrupt.emit` / `tool.interrupt.resume`), pause-for-approval tool wrapper (`wrapToolWithPauseForApproval`), and resume wait flow bound to `runId + sessionKey + toolCallId`.
- What changed: resume tokens are unguessable, only token hashes are persisted, resume enforces expiry + binding + timing-safe hash compare.
- What did NOT change (scope boundary): no UI workflow redesign; this PR adds protocol/runtime plumbing only.
## Change Type (select all)
- [x] Bug fix
- [x] Feature
- [ ] Refactor
- [ ] Docs
- [x] Security hardening
- [ ] Chore/infra
## Scope (select all touched areas)
- [x] Gateway / orchestration
- [x] Skills / tool execution
- [x] Auth / tokens
- [x] Memory / storage
- [ ] Integrations
- [x] API / contracts
- [ ] UI / DX
- [ ] CI/CD / infra
## Linked Issue/PR
- Closes #19072
- Related #19072
## User-visible / Behavior Changes
- Tools can now return `status: "paused_for_approval"` and block until resumed.
- Gateway now exposes `tool.interrupt.emit` / `tool.interrupt.resume` for scoped operator approval flows.
- Gateway now broadcasts `tool.interrupt.requested` / `tool.interrupt.resumed` events to `operator.approvals` clients.
## Security Impact (required)
- New permissions/capabilities? (`Yes`)
- Secrets/tokens handling changed? (`Yes`)
- New/changed network calls? (`Yes`)
- Command/tool execution surface changed? (`Yes`)
- Data access scope changed? (`No`)
- If any `Yes`, explain risk + mitigation:
- Risk: approval resume tokens could be abused if guessable or leaked.
- Mitigation: tokens are minted from cryptographically strong randomness, only SHA-256 hashes are persisted, compare is timing-safe, and resume requires strict binding (`approvalRequestId + runId + sessionKey + toolCallId`) plus expiry.
## Repro + Verification
### Environment
- OS: Linux (dev workspace)
- Runtime/container: Node 22 + pnpm workspace
- Model/provider: N/A
- Integration/channel (if any): Gateway RPC + agent tool runtime
- Relevant config (redacted): default gateway state dir via `resolveStateDir()`
### Steps
1. Start gateway and emit a tool interrupt via `tool.interrupt.emit` with `approvalRequestId`, binding fields, and interrupt payload.
2. Observe `tool.interrupt.requested` event and use `resumeToken` with matching binding in `tool.interrupt.resume`.
3. Verify waiting emitter resolves with resumed result; restart gateway and confirm pending/expired state survives from `gateway/tool-interrupts.json`.
### Expected
- Interrupt requests persist in gateway state dir.
- Resume succeeds only for valid token + correct run/session/tool binding before expiry.
- Paused tool wrapper resumes and returns final tool result.
### Actual
- Implemented in code and covered by new targeted tests.
- Full verification commands are blocked in this environment by npm registry DNS failures (`EAI_AGAIN`).
## Evidence
Attach at least one:
- [ ] Failing test/log before + passing after
- [x] Trace/log snippets
- [ ] Screenshot/recording
- [ ] Perf numbers (if relevant)
Trace snippets captured in this branch:
- `pnpm install` fails in workspace with `getaddrinfo EAI_AGAIN registry.npmjs.org`.
- `pnpm check` fails early with `oxfmt: not found` (dependency install incomplete).
- targeted test command fails with `Command "vitest" not found` (dependency install incomplete).
## Human Verification (required)
What you personally verified (not just CI), and how:
- Verified scenarios:
- Manual code-path review of pause extraction, emit/wait/resume flow, binding enforcement, expiry handling, persistence load/save path, and gateway method/event registration.
- New unit tests authored for manager persistence/binding/expiry, method handlers, wrapper behavior, and broadcast scope gating.
- Edge cases checked:
- Existing `approvalRequestId` with mismatched binding is rejected.
- Resume after expiry returns explicit expired error and resolves waiter as expired.
- Resume token raw value is not persisted.
- What you did **not** verify:
- End-to-end runtime execution in this clone due blocked dependency install.
## Compatibility / Migration
- Backward compatible? (`Yes`)
- Config/env changes? (`No`)
- Migration needed? (`No`)
- If yes, exact upgrade steps:
## Failure Recovery (if this breaks)
- How to disable/revert this change quickly:
- Revert this PR commit(s) to remove pause wrapper + tool interrupt RPC path.
- Files/config to restore:
- `src/agents/pi-tools.pause-for-approval.ts`
- `src/gateway/tool-interrupt-manager.ts`
- related gateway method/protocol registrations.
- Known bad symptoms reviewers should watch for:
- Paused tools never resume.
- Resume rejected despite correct operator action (binding/token mismatch).
- Interrupt events visible outside `operator.approvals` scope.
## Risks and Mitigations
List only real risks for this PR. Add/remove entries as needed. If none, write `None`.
- Risk: pending interrupt records could accumulate if not pruned.
- Mitigation: retention windows + prune on load/emit/resume/expire.
- Risk: two-phase emit callers and final-response callers may interpret response modes differently.
- Mitigation: handler preserves single-response default; only emits immediate accepted response when `twoPhase=true`.
- Risk: pause wrapper may throw if runtime context is missing binding fields.
- Mitigation: wrapper fails fast with explicit error requiring `runId`, `sessionKey`, `toolCallId` for paused flows.
## AI Assistance Disclosure
- [x] AI-assisted PR
- Testing degree: lightly tested in this environment (full build/check/tests blocked by dependency install DNS failures).
- I understand and can explain the code paths changed in this PR.
## Prompts / Session Notes
Primary prompt context used for implementation:
- "Re-implement the paused tool execution approvals PR (issue #19072) in this fresh clone/branch..."
- Required items included generic paused result state, durable gateway interrupt persistence, secure resume tokens, strict run/session/tool binding, wait-for-resume semantics, new gateway RPC methods/events, tool wrapper wiring, and validation commands.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Adds persistent, durable tool interrupt infrastructure for pause/resume approval flows across gateway restarts. Core primitives: `ToolInterruptManager` with SHA-256 token hashing + timing-safe comparison, strict run/session/tool binding enforcement, and RPC methods `tool.interrupt.emit` / `tool.interrupt.resume`. Tools returning `status: "paused_for_approval"` are wrapped to block until resumed via `wrapToolWithPauseForApproval`. All state persists to `gateway/tool-interrupts.json` with retention policies for resumed/expired records.
**Key changes:**
- New `ToolInterruptManager` with token hash persistence, expiry timers, and pruning
- Gateway RPC methods emit/resume with broadcast to `operator.approvals` scope
- Pause-for-approval wrapper intercepts paused tool results and waits for resume
- Protocol schemas define binding + interrupt payload structures
- Tests cover token security, binding enforcement, expiry, and restart persistence
**Security:**
- Resume tokens minted from `randomBytes(32)`, only SHA-256 hashes persisted
- Timing-safe comparison via `timingSafeEqual`
- Strict binding validation (`approvalRequestId` + `runId` + `sessionKey` + `toolCallId`)
- Scoped events prevent unauthorized operators from seeing interrupts
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge with minor attention to runtime context wiring
- Score reflects solid security primitives (cryptographic token handling, timing-safe comparison, strict binding enforcement) and comprehensive test coverage. Implementation follows existing patterns for durable gateway state and RPC methods. One point deducted because end-to-end runtime verification was blocked by dependency install failures, and there's a small risk around ensuring `runId`/`sessionKey` are correctly threaded through all tool execution paths
- Pay close attention to `src/agents/pi-tools.ts` and `src/agents/pi-embedded-runner/run/attempt.ts` to verify `runId` is correctly passed to tool context in all execution paths
<sub>Last reviewed commit: f2e5a05</sub>
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#22926: feat(gateway): add Windows-native watch DX and tool/channel observa...
by Kansodata · 2026-02-21
71.7%
#20596: Funding
by reconsumeralization · 2026-02-19
71.5%
#21651: fix(gateway): token fallback + operator.admin scope superset in pai...
by lan17 · 2026-02-20
71.4%
#22873: fix(tools): enforce global inline-secret blocking for tool inputs
by Kansodata · 2026-02-21
71.1%
#16244: feat(gateway): add session files API and external skill management
by wanquanY · 2026-02-14
70.7%
#20980: fix(guard): pair-atomic tool_use/tool_result commit — prevent orpha...
by amabito · 2026-02-19
69.8%
#15050: fix: transcript corruption resilience — strip aborted tool_use bloc...
by yashchitneni · 2026-02-12
69.8%
#8332: fix: add per-tool-call timeout to prevent agent hangs (v2 - fixes m...
by vishaltandale00 · 2026-02-03
69.6%
#20355: fix(gateway): enforce commands.restart guard for config.apply and c...
by Clawborn · 2026-02-18
69.5%
#12953: fix: defer gateway restart until all replies are sent
by zoskebutler · 2026-02-10
69.3%