← Back to PRs

#7454: fix: skip UTF-16 heuristic for audio/video/image MIME types (#7444)

by gavinbmoore open 2026-02-02 21:15 View on GitHub →
## Problem Voice messages (Opus/Ogg) have ~39% null bytes, which triggers the UTF-16 detection heuristic in `resolveUtf16Charset()`. This causes binary garbage (200KB+) to be dumped into conversation context. ## Solution Add an early return in `resolveUtf16Charset()` to skip the null-byte heuristic for known binary MIME types: - `audio/*` - `video/*` - `image/*` BOM detection is preserved since it's always reliable (explicit markers). ## Changes 1. Added `mimeHint` parameter to `resolveUtf16Charset()` 2. Added early return for binary MIME types before the null-byte heuristic 3. Updated `decodeTextSample()` to pass through the mimeHint 4. Updated call sites to pass `rawMime` ## Testing - `pnpm tsgo` ✅ passes - `pnpm lint` ✅ passes (0 warnings, 0 errors) Fixes #7444 <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR adjusts text/binary detection in `src/media-understanding/apply.ts` to avoid misclassifying binary attachments (notably voice messages like Opus/Ogg) as UTF‑16 based on null-byte density. `resolveUtf16Charset()` now accepts a `mimeHint` and skips the null-byte heuristic for `audio/*`, `video/*`, and `image/*` while preserving BOM-based UTF‑16 detection. Call sites were updated to pass through the detected/declared MIME (`rawMime`) so the heuristic can be gated appropriately, preventing large binary garbage from being included in conversation context. <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk. - Changes are localized to MIME-gated UTF‑16 detection and only relax a heuristic for clearly binary media types while preserving BOM detection; call-site updates are straightforward and covered by existing lint/typecheck per PR notes. - No files require special attention <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs