← Back to PRs

#19726: Fix HTML entity decoding for astral code points and surrogate-safe truncation

by Clawborn open 2026-02-18 03:54 View on GitHub →
agents size: S trusted-contributor
## Problem Two bugs in `web-fetch-utils.ts`: **1. `decodeEntities` corrupts emoji HTML entities** `String.fromCharCode` silently produces garbage for code points above U+FFFF. For example, `&#x1F600;` (😀) or `&#128512;` get decoded into lone surrogates instead of the actual emoji. This affects any fetched page that uses numeric HTML entities for emoji or CJK Extension B characters. **Fix:** Replace with `String.fromCodePoint` and add a bounds check for valid Unicode range (0–0x10FFFF). **2. `truncateText` splits UTF-16 surrogate pairs** `.slice(0, maxChars)` can cut between a high surrogate (0xD800–0xDBFF) and its low surrogate, producing an invalid string with a lone surrogate. Downstream consumers (JSON serialization, logging, LLM APIs) may choke on or silently corrupt the result. **Fix:** Detect when the last character before the cut is a high surrogate and step back one position. ## Tests Added `web-fetch-utils.test.ts` with 14 tests covering: - Basic HTML entity decoding (`&amp;`, `&lt;`, etc.) - Numeric entities for BMP and astral code points - Invalid code point handling - Title extraction and script/style stripping - Surrogate pair preservation during truncation - Edge cases (empty string, zero maxChars) <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR fixes two real bugs in `web-fetch-utils.ts`: (1) `decodeEntities` now uses `String.fromCodePoint` instead of `String.fromCharCode` to correctly decode astral code points (emoji, CJK Extension B, etc.), and (2) `truncateText` now avoids splitting UTF-16 surrogate pairs when truncating. - Both fixes are well-targeted and correct for the common cases - A new test file with 14 tests provides good coverage of the changes - **Issue found**: The `decodeEntities` bounds check (`cp >= 0 && cp <= 0x10ffff`) allows surrogate code points (0xD800–0xDFFF) through, which will cause `String.fromCodePoint` to throw a `RangeError` — crashing the function on malformed HTML entities like `&#xD800;` <h3>Confidence Score: 3/5</h3> - PR improves correctness for common cases but introduces a crash path on surrogate code point entities that should be fixed before merge. - The core changes (fromCharCode → fromCodePoint, surrogate-safe truncation) are correct and well-tested. However, the missing exclusion of surrogate code points (0xD800–0xDFFF) from the bounds check means `String.fromCodePoint` can throw on malformed HTML, which is a regression in robustness since the old `String.fromCharCode` would silently produce garbage rather than crash. - `src/agents/tools/web-fetch-utils.ts` — the `decodeEntities` function needs surrogate code point exclusion in both regex replacements <sub>Last reviewed commit: 95a78df</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs