#10559: feat(security): add plugin output scanner for prompt injection detection
stale
size: L
## Human View
### Summary
Plugins return untrusted text that gets fed back into the LLM context. If a plugin response contains injection patterns, the model can be manipulated into ignoring its guidelines.
OpenClaw already has:
- `skill-scanner.ts` — scans plugin **code** for dangerous APIs (eval, exec, etc.)
- `external-content.ts` — wraps untrusted input with boundary markers + `detectSuspiciousPatterns()`
**This PR adds the missing piece**: `output-scanner.ts` — scanning the **text returned by plugins** for prompt injection patterns before it enters the LLM context.
#### 15 OWASP LLM01-aligned patterns
| Severity | Count | Examples |
|----------|-------|---------|
| Critical | 5 | Instruction override, role hijack, guideline disregard, forget instructions |
| High | 5 | Prompt extraction, hidden markers (`[SYSTEM]`, `<\|im_start\|>`), data exfil, tool invocation |
| Medium | 3 | Zero-width chars, ANSI escapes, base64 payload execution |
| Low | 2 | Jailbreak keywords (DAN), persona override |
#### Key features
- **Prefilter**: fast keyword indexOf check before running regex — O(1) for clean output
- **Code block gating**: ignores matches inside `` ``` `` fenced blocks (reduces false positives on docs/examples)
- **Configurable maxChars**: default 64 KB, prevents unbounded scan time
- **Structured findings**: `{ ruleId, name, severity, evidence, position }`
- **`hasInjection(text)`**: boolean guard for pipeline use
- **`listScanRules()`**: introspection for documentation/admin UI
#### Usage
```ts
import { scanPluginOutput, hasInjection } from "./output-scanner.js";
// Structured scan
const result = scanPluginOutput(pluginResponse);
if (!result.clean) {
console.warn(`${result.findings.length} threats (max: ${result.maxSeverity})`);
// block, sanitize, or flag
}
// Quick guard
if (hasInjection(pluginResponse)) {
throw new Error("Plugin output contains injection");
}
```
#### What this does NOT change
- No modifications to existing files (`skill-scanner.ts`, `external-content.ts`)
- No breaking changes — purely additive new file + tests
- Complements existing security infrastructure
### Test plan
- [x] 35 vitest tests in `output-scanner.test.ts`
- [x] Clean output (normal text, code, JSON, empty string)
- [x] All 15 rules tested individually by severity
- [x] Multiple simultaneous threats + position sorting
- [x] Code block gating (injection inside ``` blocks ignored)
- [x] `ignoreCodeBlocks: false` option
- [x] `maxChars` truncation
- [x] Edge cases: very long input, case insensitivity, evidence truncation
- [x] `hasInjection()` helper
- [x] `listScanRules()` introspection
---
## AI View (DCCE Protocol v1.0)
### Metadata
- **Generator**: Claude (Anthropic) via Cursor IDE
- **Methodology**: AI-assisted development with human oversight and review
### AI Contribution Summary
- Solution design and implementation
- Test development (35 test cases)
### Verification Steps Performed
1. Analyzed existing codebase patterns
2. Implemented feature with comprehensive tests
3. Ran test suite (35 tests passing)
### Human Review Guidance
- Core changes are in: `skill-scanner.ts`, `external-content.ts`, `output-scanner.ts`
- Verify test coverage matches the described scenarios
Made with M7 [Cursor](https://cursor.com)
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
- Adds a new `src/security/output-scanner.ts` module that scans untrusted plugin-returned text for OWASP LLM01-aligned prompt injection patterns, with optional code-block gating and a max-length cap.
- Exposes a structured scan API (`scanPluginOutput`) plus convenience helpers (`hasInjection`, `listScanRules`).
- Introduces a dedicated vitest suite (`src/security/output-scanner.test.ts`) covering rule detection, options, and a few edge cases.
- Fits alongside existing security tooling by targeting plugin *output* (vs. `skill-scanner` for plugin code and `external-content` for boundary-wrapping).
<h3>Confidence Score: 3/5</h3>
- This PR is directionally safe, but a few scanner logic edge cases can cause missed or incomplete findings.
- Core idea and tests are straightforward and additive, but `scanPluginOutput` currently collects only a single match per rule, relies on shared RegExp objects (future `g`-flag changes could cause stateful false negatives), and the `maxChars` cap can be bypassed by passing `NaN`. These are fixable but should be addressed before relying on the scanner for security gating.
- src/security/output-scanner.ts
<!-- greptile_other_comments_section -->
<sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#11032: fix(security): block plugin install/load on critical source scan fi...
by coygeek · 2026-02-07
82.1%
#8086: feat(security): Add prompt injection guard rail
by bobbythelobster · 2026-02-03
80.3%
#17273: feat: add security-guard extension — agentic safety guardrails
by miloudbelarebia · 2026-02-15
79.8%
#13012: Security: detect invisible Unicode in skills and plugins (ASCII smu...
by agentwuzzi · 2026-02-10
79.7%
#5924: fix(security): add advanced multi-turn attack detection
by dan-redcupit · 2026-02-01
79.5%
#10705: security: extend skill scanner to detect threats in markdown skill ...
by Alex-Alaniz · 2026-02-06
79.5%
#17502: feat: normalize skill scanner reason codes and trust messaging
by ArthurzKV · 2026-02-15
79.4%
#6405: feat(security): Add HTTP API security hooks for plugin scanning
by masterfung · 2026-02-01
79.0%
#5923: fix(security): add input encoding detection and obfuscation decoder
by dan-redcupit · 2026-02-01
77.3%
#8238: feat: Add Glitchward Shield plugin for prompt injection protection
by eyeskiller · 2026-02-03
77.2%