#21074: security(web_fetch): strip hidden content to prevent indirect prompt injection

by hydro13 open 2026-02-19 16:32 View on GitHub →

agents size: M

Cluster: Web Search Provider Enhancements

## Problem `web_fetch` extracts content from HTML pages into the agent's context. Hidden elements — invisible to humans but present in extracted text — create an indirect prompt injection vector. See #8027 for the full description. I found several gaps while [reviewing PR #8114](https://github.com/openclaw/openclaw/pull/8114#pullrequestreview-2702612959) which addresses the same issue. This PR takes a standalone approach with broader coverage of real-world hiding techniques. ## What this PR adds A sanitization layer that strips human-invisible content before Readability processes the HTML. ### Detection vectors **CSS inline styles:** - `display:none`, `visibility:hidden`, `opacity:0`, `font-size:0` - `text-indent:-9999px`, offscreen positioning (`left/top:-9999px`) - `color:transparent`, `color:rgba(r,g,b,0)`, `color:hsla(h,s,l,0)` - `transform:scale(0)`, `transform:translateX/Y(-9999px)` - `clip-path:inset(100%)`, `width:0;height:0;overflow:hidden` **CSS class-based hiding** (the most common real-world pattern): - `.sr-only`, `.visually-hidden`, `.d-none`, `.hidden`, `.invisible`, `.screen-reader-only`, `.offscreen` - Uses Set-based token matching (split on whitespace) to avoid false positives on compound class names like `un-hidden` **HTML attributes:** `aria-hidden="true"`, `hidden`, `input[type=hidden]` **Non-content tags:** `meta`, `template`, `svg`, `canvas`, `iframe`, `object`, `embed` **Invisible Unicode:** zero-width characters (U+200B-U+200F), directional overrides (U+202A-U+202E), formatting chars (U+2060-U+2064, U+206A-U+206F), BOM (U+FEFF), Unicode tag block (U+E0000-U+E007F) **HTML comments** ### Differences from #8114 - **Class-based hiding** — #8114 only checks inline styles. This PR detects common CSS framework classes (Bootstrap, Tailwind, accessibility utilities) - **`color:transparent` / `rgba(r,g,b,0)`** — not covered in #8114 - **`transform:translateX/Y(-9999px)`** — offscreen via transform, not just `position:absolute` + `left` - **`<meta>` tag stripping** — prevents injection via meta content attributes - **Lazy linkedom import** — uses `await import("linkedom")` consistent with the existing lazy-loading pattern in `web-fetch-utils.ts`, avoiding eager double-imports - **Set-based class matching** — avoids regex word-boundary false positives ### Files changed - **`web-fetch-visibility.ts`** (new) — `sanitizeHtml()` and `stripInvisibleUnicode()` - **`web-fetch-utils.ts`** (modified) — integrates sanitization before Readability, unicode stripping on text output - **`web-fetch-visibility.test.ts`** (new) — 35 tests covering all detection vectors ### Design decisions - Uses linkedom (existing dependency) for DOM parsing — no new deps - `sanitizeHtml` is async with lazy import, matching codebase conventions - Bottom-up DOM traversal to avoid re-walking removed subtrees - Class-based detection uses `Set` with whitespace-split tokens (no regex word boundary issues) - `stripInvisibleUnicode` runs on final text output to catch anything that survives HTML processing Closes #8027