#13817: feat(agents): configurable prompt injection monitor for tool results
agents
stale
Cluster:
Security Enhancements and Fixes
## Summary
- Adds configurable prompt injection monitoring for tool results
- **Off by default** - opt-in for users who want this protection
- Configurable actions: `block` (redact), `warn` (include warning), or `log` (audit only)
- Incident logging to file for audit trail
- Human-in-the-loop flow for reviewing redacted content
## Motivation
Addresses #7705 - users wanted a way to configure prompt injection scanning.
This is a defense-in-depth measure. **It's definitely trickable** - sophisticated attacks can evade LLM-based detection. But it catches many common injection patterns and is probably better than nothing.
## Why tool results?
In CLI usage, external data (emails, APIs, databases, Slack messages, etc.) typically enters via Bash-executed scripts rather than dedicated tools. The agent writes a Python script to fetch emails, runs it, and the email contents come back as a Bash tool result.
This makes **tool result monitoring an effective chokepoint** - it catches external content regardless of how it was fetched.
### Comparison with other prompt injection PRs
| PR | Scope | Approach | Entry point |
|---|---|---|---|
| #8086 | Chat messages (Telegram, Discord, etc.) | Pattern-based regex detection | `finalizeInboundContext()` |
| #8238 | Chat messages | External API (glitchward.com) | Plugin hooks |
| #13042 | External content (emails, web, tools) | Guard model paraphrasing | Not yet integrated |
| **This PR** | Tool results | LLM scoring + redaction | `wrapToolWithPromptInjectionMonitor()` |
These are complementary, not competing. #8086/#8238 protect against malicious chat users in bot mode. This PR protects against poisoned content the agent reads in CLI mode (files, websites, script output).
If #13042 is merged, this could integrate with the guard model to offer a `sanitize` action as an alternative to blocking - we provide the detection/scoring layer, they provide the sanitization layer.
## Configuration
```yaml
# In settings.yaml
security:
promptInjection:
enabled: true
scanModel: "openai/gpt-4o-mini" # optional, falls back to default model
action: block # block | warn | log
logIncidents: true # write to audit log
logPath: "~/.openclaw/security/prompt-injection.log"
```
## How it works
1. Tool results are scored 0-100 by a small fast model for prompt injection likelihood
2. Based on the `action` setting:
- `block`: Results scoring ≥20 are redacted with a warning message
- `warn`: Results include a warning but pass through
- `log`: Results pass through, incident is logged only
3. For `block` mode, the message instructs the agent to ask the user for review before using `disable_pi_monitor` to bypass
4. `disable_pi_monitor` is single-use - it only bypasses the next tool call
5. All incidents are logged to the audit file (if enabled), including bypasses
### Audit log format
```json
{"timestamp":"2026-02-11T03:35:00.000Z","tool":"Bash","score":75,"reasoning":"Contains instruction override patterns","action":"block","bypassed":false}
```
The `bypassed` field indicates whether the user reviewed and allowed the content through via `disable_pi_monitor`.
## Limitations
- **Trickable**: Sophisticated attacks can evade LLM-based detection
- **Token cost**: Marginal compared to long conversations, though large tool outputs increase cost
- **Fail-closed**: Errors in scoring cause redaction when action=block (safe default)
## Future ideas
- **Paraphrase mode**: Paraphrasing tool outputs tends to reduce attack effectiveness. Could be interesting to explore whether this makes things safer in practice.
## Test plan
- [x] Existing tests pass
- [x] Calibration tests added (skip when no API key available)
- [ ] Manual testing with prompt injection examples
---
*Written by Claude*
Most Similar PRs
#8086: feat(security): Add prompt injection guard rail
by bobbythelobster · 2026-02-03
74.5%
#10559: feat(security): add plugin output scanner for prompt injection dete...
by DukeDeSouth · 2026-02-06
73.6%
#17027: feat: use camel to resist prompt injection
by nick1udwig · 2026-02-15
73.4%
#21291: feat: Add data plane security to default system prompt
by joetomasone · 2026-02-19
73.3%
#8238: feat: Add Glitchward Shield plugin for prompt injection protection
by eyeskiller · 2026-02-03
71.4%
#21136: fix(security): harden agent autonomy controls
by novalis133 · 2026-02-19
70.8%
#17273: feat: add security-guard extension — agentic safety guardrails
by miloudbelarebia · 2026-02-15
70.8%
#11700: feat: Caution Mode for intent-aware audit of tool outputs
by AbinashGupta · 2026-02-08
70.7%
#21035: feat: agent hardening — modifying after_tool_call hook, cron contex...
by roelven · 2026-02-19
70.5%
#6095: feat(gateway): support modular guardrails extensions for securing a...
by Reapor-Yurnero · 2026-02-01
70.0%