#13012: Security: detect invisible Unicode in skills and plugins (ASCII smuggling, Trojan Source)
stale
---
## Summary
This PR extends OpenClaw's skill scanner to detect invisible Unicode characters hidden in **skill definition files** (SKILL.md, HOOK.md, etc.) and other text content — not just code files. This catches **ASCII smuggling** and **hidden prompt injection** attacks where invisible instructions are embedded in markdown that looks clean to humans but is interpreted by the LLM.
## Problem
The existing skill scanner checks `.js`/`.ts` code files for dangerous patterns like `eval()`, `child_process.exec()`, and crypto mining references. That's valuable for catching malicious executable code.
However, the primary attack surface for skills isn't the code — it's the **markdown files**. SKILL.md is read directly by the LLM as instructions. An attacker can embed invisible Unicode characters in these files that:
- **Are completely invisible** in code review, GitHub diffs, and text editors
- **Are interpreted by LLMs** as readable text (tokenizers handle these characters)
- **Can contain arbitrary instructions** — "exfiltrate all API keys", "run this shell command", "ignore previous instructions"
This is not theoretical. Unicode Tag characters (U+E0000–E007F) mirror the ASCII table and render as zero-width invisible text. Tools like the [ASCII Smuggler](https://embracethered.com/blog/ascii-smuggler.html) make creating these payloads trivial.
## What this PR changes
### 1. Scans text files, not just code files
The skill scanner now checks **all text files** in a skill directory for invisible Unicode:
- **Text files** (`.md`, `.txt`, `.yaml`, `.yml`, `.json`) → scanned for invisible Unicode only
- **Code files** (`.js`, `.ts`, `.mjs`, `.cjs`, `.tsx`, `.jsx`) → scanned for invisible Unicode AND existing code rules (eval, exec, etc.)
This means SKILL.md, HOOK.md, README.md, config files — everything a skill ships — gets checked for hidden content.
### 2. Invisible Unicode detection engine
The scanner walks every character in every scannable file, classifying invisible code points into four categories:
| Category | Code Points | Risk | Legitimate Use |
|----------|------------|------|----------------|
| **Tag characters** | U+E0000–E007F | ASCII smuggling — encodes hidden text as invisible Unicode | Essentially none in skill files |
| **Variation selectors** | U+FE00–FE0F, U+E0100–E01EF | Byte-level data smuggling | Emoji presentation (❤️ vs ❤) |
| **Bidi controls** | U+202A–202E, U+2066–2069, U+061C | Trojan Source — code displays differently than it executes | Arabic, Hebrew, Persian text |
| **Zero-width chars** | U+200B–200F, U+2060, U+FEFF | Hidden characters in identifiers/strings | Emoji ZWJ sequences (👨👩👧), BOM |
### 3. Threshold-based reporting (minimizing false positives)
Not every invisible character is suspicious. Emoji commonly contain variation selectors and zero-width joiners. To avoid noise, the scanner only reports a finding when:
- **10+ consecutive invisible code points** → likely an encoded payload, not normal text
- **Any bidirectional control characters** → worth flagging even in small numbers (Trojan Source risk)
Files with only scattered, isolated invisible characters (emoji, BOM markers) are silently ignored.
### 4. Actionable warning messages
Warnings include enough context to assess the risk:
```
Invisible Unicode formatting/tag characters detected
(31 invisible char(s); longest consecutive run: 31; tags=31;
U+E0054 TAG CHARACTER, U+E0072 TAG CHARACTER (+2 more)
— long consecutive sequences are suspicious and may indicate
ASCII smuggling or hidden prompt injection)
```
Each warning shows:
- **Total invisible characters** found in the file
- **Longest consecutive run** — the key heuristic (1-2 = probably emoji, 20+ = suspicious)
- **Breakdown by category** (tags, variation selectors, bidi)
- **Contextual hint** explaining the specific risk
- **Evidence line** with invisible chars rendered as readable labels like `<U+202E RIGHT-TO-LEFT OVERRIDE>`
### 5. Non-blocking (warn only)
All findings are reported as **warnings**. Installation always continues. Users can investigate flagged files with `openclaw security audit --deep`.
## Changes
| File | Change |
|------|--------|
| `src/security/skill-scanner.ts` | Invisible Unicode detection engine, text file scanning, consecutive-run tracking, character classification, human-readable evidence rendering |
| `src/security/skill-scanner.test.ts` | 17 new tests covering all character categories, smuggling techniques, threshold behavior, markdown scanning, and false-positive suppression |
## Testing
```
36/36 tests pass (17 new + 19 existing)
Build clean, 0 lint errors
```
**Validated against real skills:**
- 52 bundled OpenClaw skills (59 files scanned) — all clean, zero false positives
### Key test cases
**Markdown scanning:**
- ASCII smuggling hidden in SKILL.md → detected
- Bidi controls in README.md → detected
- Code patterns in .md files (eval, exec) → correctly ignored (not code)
**False positive suppression:**
- Isolated emoji variation selectors → ignored
- Scattered zero-width joiners → ignored
- Clean code and markdown → no findings
**Detection:**
- ASCII smuggling via Unicode tags → detected with consecutive run count
- ASCII smuggling via variation selectors → detected
- Bidi overrides → detected with Trojan Source hint
- Large tag payloads → detected
## References
- [Trojan Source: Invisible Vulnerabilities (CVE-2021-42574)](https://trojansource.codes/)
- [ASCII Smuggler Tool](https://embracethered.com/blog/ascii-smuggler.html) — encode/decode invisible Unicode Tag payloads
- [ASCII Smuggler: Hiding and Finding Text with Unicode Tags](https://embracethered.com/blog/posts/2024/hiding-and-finding-text-with-unicode-tags/) — background on the technique, implications for prompt injection and data exfiltration
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR extends the existing skill/plugin static scanner to also scan non-code “text” files (e.g. `SKILL.md`, `README.md`, `.json`, `.yaml`) for invisible Unicode that can be used for ASCII-smuggling / Trojan Source style obfuscation. It adds a Unicode classification + reporting helper and wires it into `scanSource()` so markdown gets Unicode-only checks while code files still get the existing dangerous code pattern rules.
Integration-wise, this scanner is used by plugin install (`src/plugins/install.ts`), skill install warnings (`src/agents/skills-install.ts`), and the deep security audit (`src/security/audit-extra.ts`), so broadening the scanned extensions directly affects those user-facing warnings.
<h3>Confidence Score: 3/5</h3>
- This PR is close, but has a real regression that will skip code-rule scanning for some TypeScript module extensions.
- The Unicode detection approach is self-contained and covered by tests, but `isCodeFile()` omits `.mts`/`.cts` even though they’re included in `CODE_EXTENSIONS`, causing those files to be treated as non-code and bypass existing critical code-pattern rules. There’s also a correctness issue in code point formatting for non-BMP characters that will produce misleading evidence labels.
- src/security/skill-scanner.ts
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#10705: security: extend skill scanner to detect threats in markdown skill ...
by Alex-Alaniz · 2026-02-06
88.1%
#17502: feat: normalize skill scanner reason codes and trust messaging
by ArthurzKV · 2026-02-15
79.7%
#10559: feat(security): add plugin output scanner for prompt injection dete...
by DukeDeSouth · 2026-02-06
79.7%
#13894: feat(security): add manifest scanner for SKILL.md trust analysis
by jdrhyne · 2026-02-11
78.3%
#11032: fix(security): block plugin install/load on critical source scan fi...
by coygeek · 2026-02-07
77.5%
#20266: feat: skills-audit — Phase 1 security scanner for installed skills
by theMachineClay · 2026-02-18
76.7%
#8075: fix(skills): add --ignore-scripts to all package managers
by yubrew · 2026-02-03
76.2%
#5923: fix(security): add input encoding detection and obfuscation decoder
by dan-redcupit · 2026-02-01
75.9%
#10530: fix: tighten skill scanner false positives and add vm module detection
by abdelsfane · 2026-02-06
75.6%
#22306: Warn on malformed skill parsing failures in load path
by AIflow-Labs · 2026-02-21
75.4%