#5588: fix(media): skip binary audio in file extraction to prevent false UTF-16 detection
Cluster:
Media Handling Improvements
## Summary
Audio files should be handled by the transcription pipeline, not file extraction. This prevents binary audio data from being falsely detected as UTF-16 text and injected as garbage into the context window.
**Simplified fix:** Skip audio files unconditionally in `extractFileBlocks()` (unless they have a text extension like `.txt`). The complex BOM/heuristic detection is removed in favor of a simple `kind === "audio"` check.
Closes #5552
Closes #5590
## Test plan
- [x] Added tests for binary OGG/Opus files that would trigger false UTF-16 detection
- [x] Updated CSV/TSV tests to use appropriate file extensions
- [x] All 18 tests pass
- [x] Lint and build pass
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Most Similar PRs
#7454: fix: skip UTF-16 heuristic for audio/video/image MIME types (#7444)
by gavinbmoore · 2026-02-02
73.7%
#4235: fix(media): skip audio in extractFileBlocks + hasBinaryAudioMagic d...
by null-runner · 2026-01-29
73.6%
#5401: fix(media-understanding): detect audio binary by magic bytes to pre...
by RiadJamal07 · 2026-01-31
72.1%
#8048: Media: add regression test for audio text blocks (#7970)
by Abhishek-B-R · 2026-02-03
71.2%
#8388: fix(media): auto-skip tiny/empty audio files before transcription (...
by Glucksberg · 2026-02-04
69.6%
#18811: fix(media): require file extension for ambiguous MEDIA: path detection
by aldoeliacim · 2026-02-17
68.4%
#11160: Media: add missing audio MIME-to-extension mappings (aac, flac, opu...
by lailoo · 2026-02-07
67.8%
#17286: fix(media): PDF attachments embedded as raw binary instead of extra...
by yinghaosang · 2026-02-15
67.7%
#14794: fix: parse inline MEDIA: tokens in agent replies
by explainanalyze · 2026-02-12
66.3%
#19868: fix: prevent media token regex from matching markdown bold text
by sanketgautam · 2026-02-18
65.7%