← Back to PRs

#4235: fix(media): skip audio in extractFileBlocks + hasBinaryAudioMagic defense-in-depth

by null-runner open 2026-01-29 22:50 View on GitHub →
## Summary Fixes #1989 — complementary to #3904. OGG audio files (Telegram voice messages) were misidentified as `text/tab-separated-values` because `looksLikeUtf8Text()` passes on OGG headers (>85% printable ASCII) and `guessDelimitedMime()` finds tabs in the metadata. This PR combines two defenses: ### 1. Early skip for `kind === "audio"` in `extractFileBlocks()` Adds `"audio"` to the existing `image`/`video` skip list, so audio attachments bypass text extraction entirely — no buffer read needed. ```typescript if (!forcedTextMime && (kind === "image" || kind === "audio" || kind === "video")) { continue; } ``` ### 2. `hasBinaryAudioMagic()` buffer check (same approach as #3904) Detects OGG (`OggS`) and MP3-with-ID3 (`ID3`) by magic bytes, so even if an audio file slips past the kind check, `textLike` will be `false`: ```typescript const textLike = (Boolean(utf16Charset) || looksLikeUtf8Text(bufferResult?.buffer)) && !hasBinaryAudioMagic(bufferResult?.buffer); ``` ### Why both? - The kind skip handles the common path efficiently (no I/O) - The magic bytes check catches edge cases where `resolveAttachmentKind()` fails or returns something unexpected - Defense-in-depth: two independent checks for the same class of bug <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR tightens file-text extraction heuristics in `src/media-understanding/apply.ts` to prevent binary audio (notably Telegram OGG voice messages) from being misclassified as text (e.g., TSV) and passed through `extractFileContentFromSource`. Key changes: - Adds `audio` to the early skip list in `extractFileBlocks()` when no text MIME is forced by filename. - Introduces `hasBinaryAudioMagic()` and incorporates it into the `textLike` heuristic so OGG (`OggS`) and MP3-with-ID3 (`ID3`) won’t be treated as text-like even if they have ASCII-heavy headers. This fits into the media-understanding pipeline by ensuring only genuinely text-like file attachments proceed to the text/PDF extraction step, while audio is handled via the separate audio capability path. <h3>Confidence Score: 4/5</h3> - This PR is safe to merge with low risk; changes are localized and defensive. - Edits are confined to `src/media-understanding/apply.ts` and primarily add early skipping and a magic-byte guard to prevent audio being treated as text. This should reduce misclassification without affecting normal text extraction. The main nit is the helper’s docstring scope (mentions video) vs actual checks (audio-only), which is maintenance-related rather than a runtime risk. - src/media-understanding/apply.ts <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs