#19726: Fix HTML entity decoding for astral code points and surrogate-safe truncation
agents
size: S
trusted-contributor
Cluster:
Surrogate Pair Handling Fixes
## Problem
Two bugs in `web-fetch-utils.ts`:
**1. `decodeEntities` corrupts emoji HTML entities**
`String.fromCharCode` silently produces garbage for code points above U+FFFF. For example, `😀` (😀) or `😀` get decoded into lone surrogates instead of the actual emoji. This affects any fetched page that uses numeric HTML entities for emoji or CJK Extension B characters.
**Fix:** Replace with `String.fromCodePoint` and add a bounds check for valid Unicode range (0–0x10FFFF).
**2. `truncateText` splits UTF-16 surrogate pairs**
`.slice(0, maxChars)` can cut between a high surrogate (0xD800–0xDBFF) and its low surrogate, producing an invalid string with a lone surrogate. Downstream consumers (JSON serialization, logging, LLM APIs) may choke on or silently corrupt the result.
**Fix:** Detect when the last character before the cut is a high surrogate and step back one position.
## Tests
Added `web-fetch-utils.test.ts` with 14 tests covering:
- Basic HTML entity decoding (`&`, `<`, etc.)
- Numeric entities for BMP and astral code points
- Invalid code point handling
- Title extraction and script/style stripping
- Surrogate pair preservation during truncation
- Edge cases (empty string, zero maxChars)
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR fixes two real bugs in `web-fetch-utils.ts`: (1) `decodeEntities` now uses `String.fromCodePoint` instead of `String.fromCharCode` to correctly decode astral code points (emoji, CJK Extension B, etc.), and (2) `truncateText` now avoids splitting UTF-16 surrogate pairs when truncating.
- Both fixes are well-targeted and correct for the common cases
- A new test file with 14 tests provides good coverage of the changes
- **Issue found**: The `decodeEntities` bounds check (`cp >= 0 && cp <= 0x10ffff`) allows surrogate code points (0xD800–0xDFFF) through, which will cause `String.fromCodePoint` to throw a `RangeError` — crashing the function on malformed HTML entities like `�`
<h3>Confidence Score: 3/5</h3>
- PR improves correctness for common cases but introduces a crash path on surrogate code point entities that should be fixed before merge.
- The core changes (fromCharCode → fromCodePoint, surrogate-safe truncation) are correct and well-tested. However, the missing exclusion of surrogate code points (0xD800–0xDFFF) from the bounds check means `String.fromCodePoint` can throw on malformed HTML, which is a regression in robustness since the old `String.fromCharCode` would silently produce garbage rather than crash.
- `src/agents/tools/web-fetch-utils.ts` — the `decodeEntities` function needs surrogate code point exclusion in both regex replacements
<sub>Last reviewed commit: 95a78df</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#16894: Fix text truncation splitting surrogate pairs in web-fetch, subagen...
by Clawborn · 2026-02-15
83.0%
#20023: Fix surrogate pair splitting in channel metadata truncation
by Clawborn · 2026-02-18
78.4%
#20423: fix(web-fetch): cap htmlToMarkdown input size to prevent catastroph...
by Limitless2023 · 2026-02-18
77.0%
#11880: fix: guard decodeURIComponent against malformed percent-encoding in...
by Yida-Dev · 2026-02-08
74.4%
#3921: fix: sanitize fetch headers to prevent ByteString crash on Unicode ...
by nexiouscaliver · 2026-01-29
74.0%
#17686: fix(memory): support non-ASCII characters in FTS query tokenization
by Phineas1500 · 2026-02-16
73.6%
#11101: fix: handle AbortError and WebSocket 1006 in unhandled rejection ha...
by Nipurn123 · 2026-02-07
72.7%
#20496: test(utils): add comprehensive unit tests for utility functions
by masifislamm · 2026-02-19
72.6%
#5923: fix(security): add input encoding detection and obfuscation decoder
by dan-redcupit · 2026-02-01
72.5%
#19675: fix(security): prevent zero-width Unicode chars from bypassing boun...
by williamzujkowski · 2026-02-18
72.4%