#13817: feat(agents): configurable prompt injection monitor for tool results

by ElleNajt open 2026-02-11 02:30 View on GitHub →

agents stale

Cluster: Security Enhancements and Fixes

## Summary - Adds configurable prompt injection monitoring for tool results - **Off by default** - opt-in for users who want this protection - Configurable actions: `block` (redact), `warn` (include warning), or `log` (audit only) - Incident logging to file for audit trail - Human-in-the-loop flow for reviewing redacted content ## Motivation Addresses #7705 - users wanted a way to configure prompt injection scanning. This is a defense-in-depth measure. **It's definitely trickable** - sophisticated attacks can evade LLM-based detection. But it catches many common injection patterns and is probably better than nothing. ## Why tool results? In CLI usage, external data (emails, APIs, databases, Slack messages, etc.) typically enters via Bash-executed scripts rather than dedicated tools. The agent writes a Python script to fetch emails, runs it, and the email contents come back as a Bash tool result. This makes **tool result monitoring an effective chokepoint** - it catches external content regardless of how it was fetched. ### Comparison with other prompt injection PRs | PR | Scope | Approach | Entry point | |---|---|---|---| | #8086 | Chat messages (Telegram, Discord, etc.) | Pattern-based regex detection | `finalizeInboundContext()` | | #8238 | Chat messages | External API (glitchward.com) | Plugin hooks | | #13042 | External content (emails, web, tools) | Guard model paraphrasing | Not yet integrated | | **This PR** | Tool results | LLM scoring + redaction | `wrapToolWithPromptInjectionMonitor()` | These are complementary, not competing. #8086/#8238 protect against malicious chat users in bot mode. This PR protects against poisoned content the agent reads in CLI mode (files, websites, script output). If #13042 is merged, this could integrate with the guard model to offer a `sanitize` action as an alternative to blocking - we provide the detection/scoring layer, they provide the sanitization layer. ## Configuration ```yaml # In settings.yaml security: promptInjection: enabled: true scanModel: "openai/gpt-4o-mini" # optional, falls back to default model action: block # block | warn | log logIncidents: true # write to audit log logPath: "~/.openclaw/security/prompt-injection.log" ``` ## How it works 1. Tool results are scored 0-100 by a small fast model for prompt injection likelihood 2. Based on the `action` setting: - `block`: Results scoring ≥20 are redacted with a warning message - `warn`: Results include a warning but pass through - `log`: Results pass through, incident is logged only 3. For `block` mode, the message instructs the agent to ask the user for review before using `disable_pi_monitor` to bypass 4. `disable_pi_monitor` is single-use - it only bypasses the next tool call 5. All incidents are logged to the audit file (if enabled), including bypasses ### Audit log format ```json {"timestamp":"2026-02-11T03:35:00.000Z","tool":"Bash","score":75,"reasoning":"Contains instruction override patterns","action":"block","bypassed":false} ``` The `bypassed` field indicates whether the user reviewed and allowed the content through via `disable_pi_monitor`. ## Limitations - **Trickable**: Sophisticated attacks can evade LLM-based detection - **Token cost**: Marginal compared to long conversations, though large tool outputs increase cost - **Fail-closed**: Errors in scoring cause redaction when action=block (safe default) ## Future ideas - **Paraphrase mode**: Paraphrasing tool outputs tends to reduce attack effectiveness. Could be interesting to explore whether this makes things safer in practice. ## Test plan - [x] Existing tests pass - [x] Calibration tests added (skip when no API key available) - [ ] Manual testing with prompt injection examples --- *Written by Claude*