#20423: fix(web-fetch): cap htmlToMarkdown input size to prevent catastrophic backtracking on large HTML

by Limitless2023 open 2026-02-18 22:14 View on GitHub →

app: web-ui agents size: XS

Cluster: Documentation and Web Fetch Enhancements

## Problem `web_fetch` hangs indefinitely on ~1MB+ pages and never returns — even when `maxChars: 5000` is set. The 5+ minute subagent timeout does not fire: ``` 21:03:14.411 tool start: web_fetch toolCallId=toolu_01NMg... [NO TOOL END — hung indefinitely until manual kill] ``` ## Root Cause `htmlToMarkdown()` in `web-fetch-utils.ts` runs several non-greedy `[\s\S]*?` regexes on the **full HTML body** including: ```ts .replace(/<script[\s\S]*?<\/script>/gi, "") // can backtrack on unclosed tags .replace(/<a\s+[^>]*href=["']...[\s\S]*?<\/a>/gi, ...) // complex, high risk .replace(/<h([1-6])[^>]*>([\s\S]*?)<\/h\1>/gi, ...) .replace(/<li[^>]*>([\s\S]*?)<\/li>/gi, ...) ``` When HTML is > `READABILITY_MAX_HTML_CHARS` (1 MB), `extractReadableContent()` skips the Readability path and calls `fallback()` → `htmlToMarkdown(html)` on the **entire 1MB+ input**. For pages with unclosed or deeply-nested tags (common in Next.js/React-rendered HTML), these patterns can exhibit **O(n²) catastrophic backtracking**, causing the event loop to stall indefinitely. The Anthropic docs page used as a repro returns 1,090,956 bytes (> 1MB threshold), which triggers the fallback path and the hang. ## Fix Capture the `<title>` tag first (near the start of `<head>`), then **truncate HTML to 500KB** before running any regex-based passes. For a 1MB page this halves the processing surface; for the reported repro URL (1MB), it reduces the at-risk input by ~54%. Partial extraction on large pages is the intended tradeoff — pages with content past the 500KB mark may see truncated output, but the tool will always complete. ## Alternative considered A proper fix would replace non-greedy `[\s\S]*?` patterns with atomic/possessive alternatives or a proper HTML parser. That would be a larger refactor; this PR provides a minimal, safe stopgap that prevents hangs in production. Fixes #20385  <h3>Greptile Summary</h3> This PR addresses two issues: (1) `htmlToMarkdown()` hanging indefinitely on large (1MB+) HTML pages due to catastrophic backtracking in `[\s\S]*?` regex patterns, and (2) `marked.parse()` silently losing `gfm`/`breaks` options when an options object is passed in marked v7+ (which creates an isolated context that doesn't inherit `setOptions()` globals). - **`web-fetch-utils.ts`**: Adds a 500KB cap (`HTML_TO_MARKDOWN_MAX_CHARS`) on the HTML input to `htmlToMarkdown()` before the expensive regex passes. The `<title>` extraction runs on the full input first (safe since `<title>` is early in `<head>` and the regex is O(n) for this simple pattern). This is an effective stopgap — a proper fix would replace the regex-based HTML conversion with a parser, but this prevents production hangs. - **`ui/src/ui/markdown.ts`**: Explicitly passes `gfm: true, breaks: true` to the `marked.parse()` call that already receives a `renderer` option. With `marked` v17.0.3+, passing any options object creates an isolated context, so the global `setOptions()` values were silently dropped. This restores the intended GFM and line-break rendering behavior. <h3>Confidence Score: 4/5</h3> - This PR is safe to merge — both changes are minimal, well-scoped, and address real production issues. - Score of 4 reflects that both fixes are correct and low-risk. The HTML truncation cap is a pragmatic stopgap (not a complete fix) that may truncate content on very large pages, but this is an intentional and documented tradeoff. The marked options fix is straightforward and correct for v17+. No new tests were added, but the changes are defensive in nature and unlikely to introduce regressions. - No files require special attention. Both changes are minimal and well-commented. <sub>Last reviewed commit: 56510b2</sub>