#4235: fix(media): skip audio in extractFileBlocks + hasBinaryAudioMagic defense-in-depth
Cluster:
Media Handling Improvements
## Summary
Fixes #1989 — complementary to #3904.
OGG audio files (Telegram voice messages) were misidentified as `text/tab-separated-values` because `looksLikeUtf8Text()` passes on OGG headers (>85% printable ASCII) and `guessDelimitedMime()` finds tabs in the metadata.
This PR combines two defenses:
### 1. Early skip for `kind === "audio"` in `extractFileBlocks()`
Adds `"audio"` to the existing `image`/`video` skip list, so audio attachments bypass text extraction entirely — no buffer read needed.
```typescript
if (!forcedTextMime && (kind === "image" || kind === "audio" || kind === "video")) {
continue;
}
```
### 2. `hasBinaryAudioMagic()` buffer check (same approach as #3904)
Detects OGG (`OggS`) and MP3-with-ID3 (`ID3`) by magic bytes, so even if an audio file slips past the kind check, `textLike` will be `false`:
```typescript
const textLike =
(Boolean(utf16Charset) || looksLikeUtf8Text(bufferResult?.buffer)) &&
!hasBinaryAudioMagic(bufferResult?.buffer);
```
### Why both?
- The kind skip handles the common path efficiently (no I/O)
- The magic bytes check catches edge cases where `resolveAttachmentKind()` fails or returns something unexpected
- Defense-in-depth: two independent checks for the same class of bug
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR tightens file-text extraction heuristics in `src/media-understanding/apply.ts` to prevent binary audio (notably Telegram OGG voice messages) from being misclassified as text (e.g., TSV) and passed through `extractFileContentFromSource`.
Key changes:
- Adds `audio` to the early skip list in `extractFileBlocks()` when no text MIME is forced by filename.
- Introduces `hasBinaryAudioMagic()` and incorporates it into the `textLike` heuristic so OGG (`OggS`) and MP3-with-ID3 (`ID3`) won’t be treated as text-like even if they have ASCII-heavy headers.
This fits into the media-understanding pipeline by ensuring only genuinely text-like file attachments proceed to the text/PDF extraction step, while audio is handled via the separate audio capability path.
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge with low risk; changes are localized and defensive.
- Edits are confined to `src/media-understanding/apply.ts` and primarily add early skipping and a magic-byte guard to prevent audio being treated as text. This should reduce misclassification without affecting normal text extraction. The main nit is the helper’s docstring scope (mentions video) vs actual checks (audio-only), which is maintenance-related rather than a runtime risk.
- src/media-understanding/apply.ts
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#5401: fix(media-understanding): detect audio binary by magic bytes to pre...
by RiadJamal07 · 2026-01-31
86.5%
#8048: Media: add regression test for audio text blocks (#7970)
by Abhishek-B-R · 2026-02-03
83.3%
#7454: fix: skip UTF-16 heuristic for audio/video/image MIME types (#7444)
by gavinbmoore · 2026-02-02
81.8%
#17286: fix(media): PDF attachments embedded as raw binary instead of extra...
by yinghaosang · 2026-02-15
78.1%
#8388: fix(media): auto-skip tiny/empty audio files before transcription (...
by Glucksberg · 2026-02-04
77.7%
#11443: LINE: fix buffer guards in detectContentType + add tests
by MdRahmatUllah · 2026-02-07
77.2%
#21110: fix(tts): deliver audio via structured mediaUrl instead of MEDIA: t...
by hydro13 · 2026-02-19
76.8%
#14794: fix: parse inline MEDIA: tokens in agent replies
by explainanalyze · 2026-02-12
76.4%
#11160: Media: add missing audio MIME-to-extension mappings (aac, flac, opu...
by lailoo · 2026-02-07
75.9%
#19868: fix: prevent media token regex from matching markdown bold text
by sanketgautam · 2026-02-18
75.7%