#20423: fix(web-fetch): cap htmlToMarkdown input size to prevent catastrophic backtracking on large HTML
app: web-ui
agents
size: XS
## Problem
`web_fetch` hangs indefinitely on ~1MB+ pages and never returns — even when
`maxChars: 5000` is set. The 5+ minute subagent timeout does not fire:
```
21:03:14.411 tool start: web_fetch toolCallId=toolu_01NMg...
[NO TOOL END — hung indefinitely until manual kill]
```
## Root Cause
`htmlToMarkdown()` in `web-fetch-utils.ts` runs several non-greedy `[\s\S]*?` regexes
on the **full HTML body** including:
```ts
.replace(/<script[\s\S]*?<\/script>/gi, "") // can backtrack on unclosed tags
.replace(/<a\s+[^>]*href=["']...[\s\S]*?<\/a>/gi, ...) // complex, high risk
.replace(/<h([1-6])[^>]*>([\s\S]*?)<\/h\1>/gi, ...)
.replace(/<li[^>]*>([\s\S]*?)<\/li>/gi, ...)
```
When HTML is > `READABILITY_MAX_HTML_CHARS` (1 MB), `extractReadableContent()`
skips the Readability path and calls `fallback()` → `htmlToMarkdown(html)` on the
**entire 1MB+ input**. For pages with unclosed or deeply-nested tags (common in
Next.js/React-rendered HTML), these patterns can exhibit **O(n²) catastrophic
backtracking**, causing the event loop to stall indefinitely.
The Anthropic docs page used as a repro returns 1,090,956 bytes (> 1MB threshold),
which triggers the fallback path and the hang.
## Fix
Capture the `<title>` tag first (near the start of `<head>`), then **truncate HTML
to 500KB** before running any regex-based passes. For a 1MB page this halves the
processing surface; for the reported repro URL (1MB), it reduces the at-risk input
by ~54%.
Partial extraction on large pages is the intended tradeoff — pages with content
past the 500KB mark may see truncated output, but the tool will always complete.
## Alternative considered
A proper fix would replace non-greedy `[\s\S]*?` patterns with atomic/possessive
alternatives or a proper HTML parser. That would be a larger refactor; this PR
provides a minimal, safe stopgap that prevents hangs in production.
Fixes #20385
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR addresses two issues: (1) `htmlToMarkdown()` hanging indefinitely on large (1MB+) HTML pages due to catastrophic backtracking in `[\s\S]*?` regex patterns, and (2) `marked.parse()` silently losing `gfm`/`breaks` options when an options object is passed in marked v7+ (which creates an isolated context that doesn't inherit `setOptions()` globals).
- **`web-fetch-utils.ts`**: Adds a 500KB cap (`HTML_TO_MARKDOWN_MAX_CHARS`) on the HTML input to `htmlToMarkdown()` before the expensive regex passes. The `<title>` extraction runs on the full input first (safe since `<title>` is early in `<head>` and the regex is O(n) for this simple pattern). This is an effective stopgap — a proper fix would replace the regex-based HTML conversion with a parser, but this prevents production hangs.
- **`ui/src/ui/markdown.ts`**: Explicitly passes `gfm: true, breaks: true` to the `marked.parse()` call that already receives a `renderer` option. With `marked` v17.0.3+, passing any options object creates an isolated context, so the global `setOptions()` values were silently dropped. This restores the intended GFM and line-break rendering behavior.
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — both changes are minimal, well-scoped, and address real production issues.
- Score of 4 reflects that both fixes are correct and low-risk. The HTML truncation cap is a pragmatic stopgap (not a complete fix) that may truncate content on very large pages, but this is an intentional and documented tradeoff. The marked options fix is straightforward and correct for v17+. No new tests were added, but the changes are defensive in nature and unlikely to introduce regressions.
- No files require special attention. Both changes are minimal and well-commented.
<sub>Last reviewed commit: 56510b2</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#20419: fix(webchat): explicitly pass gfm and breaks options to marked.parse()
by Limitless2023 · 2026-02-18
79.0%
#15251: feat(web-fetch): send Accept: text/markdown header for Cloudflare M...
by wujieli0207 · 2026-02-13
77.9%
#15530: docs(web_fetch): document markdown-first Accept header and cf-markd...
by novavale · 2026-02-13
77.3%
#19726: Fix HTML entity decoding for astral code points and surrogate-safe ...
by Clawborn · 2026-02-18
77.0%
#9710: fix(ui): prevent CPU spike when opening large tool outputs (#9700)
by divol89 · 2026-02-05
76.8%
#16590: fix(web-fetch): use bot UA for markdown to enable Cloudflare LLM co...
by Imccccc · 2026-02-14
75.9%
#15414: feat(web-fetch): add Accept: text/markdown header for Cloudflare Ma...
by aldoeliacim · 2026-02-13
75.4%
#16733: fix(ui): avoid injected newlines when tool output is hidden
by jp117 · 2026-02-15
73.0%
#19675: fix(security): prevent zero-width Unicode chars from bypassing boun...
by williamzujkowski · 2026-02-18
72.9%
#6260: fix(tui): prevent width overflow crashes from nested ANSI escape codes
by 0xktn · 2026-02-01
72.8%