← Back to PRs

#10559: feat(security): add plugin output scanner for prompt injection detection

by DukeDeSouth open 2026-02-06 17:27 View on GitHub →
stale size: L
## Human View ### Summary Plugins return untrusted text that gets fed back into the LLM context. If a plugin response contains injection patterns, the model can be manipulated into ignoring its guidelines. OpenClaw already has: - `skill-scanner.ts` — scans plugin **code** for dangerous APIs (eval, exec, etc.) - `external-content.ts` — wraps untrusted input with boundary markers + `detectSuspiciousPatterns()` **This PR adds the missing piece**: `output-scanner.ts` — scanning the **text returned by plugins** for prompt injection patterns before it enters the LLM context. #### 15 OWASP LLM01-aligned patterns | Severity | Count | Examples | |----------|-------|---------| | Critical | 5 | Instruction override, role hijack, guideline disregard, forget instructions | | High | 5 | Prompt extraction, hidden markers (`[SYSTEM]`, `<\|im_start\|>`), data exfil, tool invocation | | Medium | 3 | Zero-width chars, ANSI escapes, base64 payload execution | | Low | 2 | Jailbreak keywords (DAN), persona override | #### Key features - **Prefilter**: fast keyword indexOf check before running regex — O(1) for clean output - **Code block gating**: ignores matches inside `` ``` `` fenced blocks (reduces false positives on docs/examples) - **Configurable maxChars**: default 64 KB, prevents unbounded scan time - **Structured findings**: `{ ruleId, name, severity, evidence, position }` - **`hasInjection(text)`**: boolean guard for pipeline use - **`listScanRules()`**: introspection for documentation/admin UI #### Usage ```ts import { scanPluginOutput, hasInjection } from "./output-scanner.js"; // Structured scan const result = scanPluginOutput(pluginResponse); if (!result.clean) { console.warn(`${result.findings.length} threats (max: ${result.maxSeverity})`); // block, sanitize, or flag } // Quick guard if (hasInjection(pluginResponse)) { throw new Error("Plugin output contains injection"); } ``` #### What this does NOT change - No modifications to existing files (`skill-scanner.ts`, `external-content.ts`) - No breaking changes — purely additive new file + tests - Complements existing security infrastructure ### Test plan - [x] 35 vitest tests in `output-scanner.test.ts` - [x] Clean output (normal text, code, JSON, empty string) - [x] All 15 rules tested individually by severity - [x] Multiple simultaneous threats + position sorting - [x] Code block gating (injection inside ``` blocks ignored) - [x] `ignoreCodeBlocks: false` option - [x] `maxChars` truncation - [x] Edge cases: very long input, case insensitivity, evidence truncation - [x] `hasInjection()` helper - [x] `listScanRules()` introspection --- ## AI View (DCCE Protocol v1.0) ### Metadata - **Generator**: Claude (Anthropic) via Cursor IDE - **Methodology**: AI-assisted development with human oversight and review ### AI Contribution Summary - Solution design and implementation - Test development (35 test cases) ### Verification Steps Performed 1. Analyzed existing codebase patterns 2. Implemented feature with comprehensive tests 3. Ran test suite (35 tests passing) ### Human Review Guidance - Core changes are in: `skill-scanner.ts`, `external-content.ts`, `output-scanner.ts` - Verify test coverage matches the described scenarios Made with M7 [Cursor](https://cursor.com) <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> - Adds a new `src/security/output-scanner.ts` module that scans untrusted plugin-returned text for OWASP LLM01-aligned prompt injection patterns, with optional code-block gating and a max-length cap. - Exposes a structured scan API (`scanPluginOutput`) plus convenience helpers (`hasInjection`, `listScanRules`). - Introduces a dedicated vitest suite (`src/security/output-scanner.test.ts`) covering rule detection, options, and a few edge cases. - Fits alongside existing security tooling by targeting plugin *output* (vs. `skill-scanner` for plugin code and `external-content` for boundary-wrapping). <h3>Confidence Score: 3/5</h3> - This PR is directionally safe, but a few scanner logic edge cases can cause missed or incomplete findings. - Core idea and tests are straightforward and additive, but `scanPluginOutput` currently collects only a single match per rule, relies on shared RegExp objects (future `g`-flag changes could cause stateful false negatives), and the `maxChars` cap can be bypassed by passing `NaN`. These are fixable but should be addressed before relying on the scanner for security gating. - src/security/output-scanner.ts <!-- greptile_other_comments_section --> <sub>(4/5) You can add custom instructions or style guidelines for the agent [here](https://app.greptile.com/review/github)!</sub> <!-- /greptile_comment -->

Most Similar PRs