#7454: fix: skip UTF-16 heuristic for audio/video/image MIME types (#7444)
Cluster:
Media Handling Improvements
## Problem
Voice messages (Opus/Ogg) have ~39% null bytes, which triggers the UTF-16 detection heuristic in `resolveUtf16Charset()`. This causes binary garbage (200KB+) to be dumped into conversation context.
## Solution
Add an early return in `resolveUtf16Charset()` to skip the null-byte heuristic for known binary MIME types:
- `audio/*`
- `video/*`
- `image/*`
BOM detection is preserved since it's always reliable (explicit markers).
## Changes
1. Added `mimeHint` parameter to `resolveUtf16Charset()`
2. Added early return for binary MIME types before the null-byte heuristic
3. Updated `decodeTextSample()` to pass through the mimeHint
4. Updated call sites to pass `rawMime`
## Testing
- `pnpm tsgo` ✅ passes
- `pnpm lint` ✅ passes (0 warnings, 0 errors)
Fixes #7444
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR adjusts text/binary detection in `src/media-understanding/apply.ts` to avoid misclassifying binary attachments (notably voice messages like Opus/Ogg) as UTF‑16 based on null-byte density. `resolveUtf16Charset()` now accepts a `mimeHint` and skips the null-byte heuristic for `audio/*`, `video/*`, and `image/*` while preserving BOM-based UTF‑16 detection. Call sites were updated to pass through the detected/declared MIME (`rawMime`) so the heuristic can be gated appropriately, preventing large binary garbage from being included in conversation context.
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk.
- Changes are localized to MIME-gated UTF‑16 detection and only relax a heuristic for clearly binary media types while preserving BOM detection; call-site updates are straightforward and covered by existing lint/typecheck per PR notes.
- No files require special attention
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#4235: fix(media): skip audio in extractFileBlocks + hasBinaryAudioMagic d...
by null-runner · 2026-01-29
81.8%
#10257: fix(security): anchor MIME sanitization regex and block fullwidth b...
by nu-gui · 2026-02-06
77.6%
#11160: Media: add missing audio MIME-to-extension mappings (aac, flac, opu...
by lailoo · 2026-02-07
77.2%
#17286: fix(media): PDF attachments embedded as raw binary instead of extra...
by yinghaosang · 2026-02-15
77.0%
#5401: fix(media-understanding): detect audio binary by magic bytes to pre...
by RiadJamal07 · 2026-01-31
76.5%
#8048: Media: add regression test for audio text blocks (#7970)
by Abhishek-B-R · 2026-02-03
76.4%
#11443: LINE: fix buffer guards in detectContentType + add tests
by MdRahmatUllah · 2026-02-07
75.4%
#8388: fix(media): auto-skip tiny/empty audio files before transcription (...
by Glucksberg · 2026-02-04
75.0%
#18811: fix(media): require file extension for ambiguous MEDIA: path detection
by aldoeliacim · 2026-02-17
73.9%
#15770: fix: prevent phantom <media:unknown> messages from Signal protocol ...
by joetomasone · 2026-02-13
73.8%