#6095: feat(gateway): support modular guardrails extensions for securing against indirect prompt injections and other agentic threats

by Reapor-Yurnero open 2026-02-01 08:30 View on GitHub →

docs gateway agents size: XL

Cluster: Security Enhancements and Guardrails

# Modular Guardrail / Validators / Interceptors via Plugin Hooks ## Summary Introduces configurable pre- and post-message guardrail plugin system for monitoring all LLM traffic so that users can incorporate their guardrail of choice to block indirect prompt injection attacks and other policy violations. Initial selections are the open-weight gpt-oss-safeguard and cloud-based Gray Swan Cygnal, but any guardrail model can be configured similarly as a plugin. In addition to these model based guardrail, rule-based validators or monitors are also supported in this plugin based interface. Updates documentation, tests, and onboarding workflow to make configuration easy. ## Why OpenClaw is an agent with deep access to tools, files, networks, and external accounts. That makes **prompt‑level attacks (especially indirect prompt injection / IPI)** uniquely dangerous: a single malicious message or web page can steer an agent into data exfiltration, unsafe tool use, or policy bypass. The broader community has been paying increasing attention to these risks as more systems move from chatbots to tool‑enabled agents (see the below relevant PRs/issues) Critically, OpenClaw needs **defense‑in‑depth** that can: - inspect inputs before the model sees them, - validate tool calls/results, - and scrutinize final outputs for risky behavior. More importantly, all these should be **fully customizable** according to each user’s **needs** and desired **policies**. This PR adds the minimal core hooks required for this protection and shows four diverse model‑based and non‑model guardrails via plugins. ## Example effects: - Policy-violating request being blocked in Slack by GPT-OSS-20B <img width="600" height="220" alt="slack example 1(a policy violation)" src="https://github.com/user-attachments/assets/31137032-5f0b-4198-8a57-797e40dbf652" /> - Prompt injection attempt being blocked in Slack by Gray Swan Cygnal <img width="600" height="330" alt="slack example 2 (a prompt injection)" src="https://github.com/user-attachments/assets/8926636c-8462-45ca-906d-00557c132f73" /> - Unsafe tool call being rejected <img width="600" height="380" alt="an example that a tool call being rejected due to policy violation" src="https://github.com/user-attachments/assets/e3b41b6d-c426-4fde-88cd-5c0db4d6d172" /> - Tool response with indirect prompt injections are marked <img width="600" alt="image" src="https://github.com/user-attachments/assets/479ddc72-965e-4a7a-a864-4ed030e26d6d" /> ## What this PR does - Adds **minimal core wiring** so guardrails can run at the right lifecycle stages via the existing plugin hook system. - Provides a **generic, extensible, and super flexible guardrail interface** that supports both model-based and non‑model validators/rule checkers etc.. - Demonstrates the approach with **four guardrail plugins**: - `extensions/grayswan-cygnal-guardrail` (API-based model guardrail) - `extensions/gpt-oss-safeguard` (open‑weight model guardrail) - `extensions/command-safety-guard` (rule-based command validator for `exec`) - `extensions/security-audit` (rule-based tool-call audit/monitoring) (The latter two were proposed by @pauloportella in #6569) ## Core changes (kept minimal yet essential) - New hook stages for non‑tool guardrails: `before_request`, `after_response` `src/plugins/types.ts`, `src/plugins/hooks.ts` - Guardrail hook execution + block handling in the agent loop `src/agents/pi-embedded-runner/run/attempt.ts`, `src/agents/pi-embedded-runner/run.ts` - Tool hook context wiring for guardrails `src/agents/pi-tool-definition-adapter.ts`, `src/agents/pi-embedded-runner/tool-split.ts`, `src/agents/pi-tools.before-tool-call.ts` - Guardrail helper/factory utilities for consistent plugin behavior `src/plugins/guardrails-utils.ts`, `src/plugins/guardrails-utils.test.ts`, `src/plugin-sdk/index.ts` - Docs: guardrail usage + examples `docs/gateway/guardrails.md` ## Rationale for maintainers This PR keeps core changes narrowly scoped to hook types and wiring; most of the guardrail logic lives in extensions. The result is a flexible guardrail surface with minimal risk to existing behavior. Also, happy to decompose the extensions to subPRs etc. if needed. Putting here more for demonstration purposes. ## Testing - `pnpm lint` - `pnpm format` - `pnpm test` - `pnpm build` ## AI assistance - **AI-assisted:** Yes (Codex CLI) - **Testing:** `pnpm lint`, `pnpm build` - **Prompts/logs:** available on request - **Understanding:** I’ve reviewed the changes and understand the code # Issues that this would close - #4011 - #4840 - #5155 - #5513 - #5943 - #6459 - #6613 - #6823 - #6535 - #7597 - #7604 - #8093 - #7705 - #7829 # Ongoing PRs that this would replace / complent - Static prompt injection checks, PII filters, dangerous command denylists, and other middleware can now be _supported asextensions/hooks with ours_ - #5923 - #5924 - #6486 - #6592 - #7346 - #8086 - #8023 - #8238 - #11681 - #11787 - #11119 as pointed by its author, it also introduces new mechanism. but in general, can be more easily extended on top of this plugin-based hook structure. - #12050 - #11700 - #10559 - #8818 - #9030 - #9748 - Docs/system prompts around prompt injection security that most likely shouldn’t be baked into core - #4278 - #5922 base system prompts can have extra security guidance added in via extension if desired, shouldn’t be baked in to this degree - #10514 - Fixing before_tool_call and after_tool_call plugin hooks to actually be called - #2340 - #6264 - Note that #6660 fixed before_tool_call but not yet after_tool_call - Update on hooks system to allow for modular security guardrails plugins or other security policies - #6405 this introduces a new parallel HTTP hooks system that can likely better be handled within the existing hooks system - #6569 this introduces a parallel “interceptor” system instead of using existing hooks - #11071 - #10539 # PRs that depends on this one - #8448 - https://github.com/grayswansecurity/openclaw/pull/6  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR wires a modular “guardrails” plugin system into the agent lifecycle by adding new hook stages (`before_request`, `after_response`) and expanding the existing tool hooks to support inspection, mutation, and blocking (including returning synthetic tool results). The embedded runner now executes these hooks around model calls, and the tool definition adapter invokes `before_tool_call`/`after_tool_call` with richer context (messages/system prompt). New guardrail utilities and example extensions demonstrate model-based and rule-based guardrails. Key review focus areas were correctness of hook result merging and the stability of event contracts (IDs/context) across call sites, since plugins will depend heavily on these semantics. <h3>Confidence Score: 3/5</h3> - This PR is reasonably safe to merge, but there are a couple of behavioral edge cases in hook/guardrail semantics that could surprise plugin authors. - Core wiring looks coherent and tests exist, but the tool hook result-merging logic can leak prior handlers’ synthetic results, and the toolCallId handling has a type/behavior mismatch that could hide real correlation issues. These are fixable without redesigning the feature. - src/plugins/hooks.ts, src/agents/pi-tool-definition-adapter.ts, src/agents/pi-embedded-runner/run/attempt.ts