#17286: fix(media): PDF attachments embedded as raw binary instead of extracted text (#17198)
stale
size: M
trusted-contributor
Cluster:
Media Handling Improvements
## Summary
PDF attachments get dumped as ~200KB of raw binary into the agent context instead of going through `extractPdfContent()` for proper text extraction. Happens because `looksLikeUtf8Text()` sees the `%PDF-1.7` header as valid ASCII and overrides the MIME type to `text/plain`.
Closes #17198
lobster-biscuit
## Root Cause
In `extractFileBlocks()` (`src/media-understanding/apply.ts`), the MIME resolution lets `textHint` override `rawMime` unconditionally. PDF headers are mostly printable ASCII, so `looksLikeUtf8Text()` returns `true`, `textHint` becomes `"text/plain"`, and `extractFileContentFromSource()` never hits the `application/pdf` branch.
## Changes
- Before: `mimeType` always prefers `textHint` over `rawMime`, so PDFs get `text/plain` and the raw binary is inlined
- After: when `rawMime` is specifically `application/pdf` and there's no extension-based override (`forcedTextMimeResolved`), the original MIME is preserved so the PDF extraction path is used
## Tests
`apply.test.ts` — 4 tests covering the fix:
1. **MIME preservation**: writes a minimal valid PDF with `MediaType: "application/pdf"`, confirms the output block has `mime="application/pdf"` (not `text/plain`). Fails before fix, passes after.
2. **Text extraction end-to-end**: writes a PDF with embedded text content (`Hello World` via BT/Tj operators), confirms the extracted text appears in the output block and raw `%PDF` binary does not — verifying the full PDF extraction pipeline runs.
3. **Extension-based override still works**: a `.txt` file with `MediaType: "application/pdf"` correctly resolves to `text/plain` via the extension-based path (`forcedTextMimeResolved`), ensuring the fix doesn't break the extension override logic.
4. **Non-PDF types unaffected**: `application/octet-stream` is still correctly skipped by `isBinaryMediaMime`, confirming the fix is scoped to `application/pdf` only.
All 28 tests in `src/media-understanding/` pass. `pnpm build` and `pnpm lint` pass.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Fixes a bug where PDF attachments were dumped as raw binary into the agent context instead of going through the `extractPdfContent()` text extraction pipeline. The root cause was that `looksLikeUtf8Text()` misclassified PDF headers (which are mostly printable ASCII) as valid text, causing the MIME type to be overridden from `application/pdf` to `text/plain`.
- The fix in `extractFileBlocks()` preserves the original `application/pdf` MIME type when there's no extension-based override, ensuring the PDF extraction path in `extractFileContentFromSource()` is correctly invoked
- The condition is narrowly scoped to `normalizedRawMime === "application/pdf"` (exact match), avoiding unintended side effects on other MIME types
- The logging condition is tightened to only fire when a MIME override actually takes effect
- Four new tests cover: MIME preservation, end-to-end text extraction, extension-based override compatibility, and non-PDF type behavior
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — the fix is narrowly scoped to a specific MIME type and well-covered by tests.
- The logic change is minimal and targeted: an exact-match condition for `application/pdf` that preserves the original MIME instead of letting the text heuristic override it. The extension-based override path (`forcedTextMimeResolved`) is explicitly preserved. Four new tests cover the fix, the extension override path, and non-PDF behavior. The downstream `extractFileContentFromSource` already has proper `application/pdf` handling, and failure cases are caught by the existing try/catch. Score is 4 rather than 5 because the new test file doesn't clean up temporary directories (though this follows existing patterns in the codebase).
- No files require special attention.
<sub>Last reviewed commit: 1cdbf5d</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#4235: fix(media): skip audio in extractFileBlocks + hasBinaryAudioMagic d...
by null-runner · 2026-01-29
78.1%
#18811: fix(media): require file extension for ambiguous MEDIA: path detection
by aldoeliacim · 2026-02-17
77.5%
#7454: fix: skip UTF-16 heuristic for audio/video/image MIME types (#7444)
by gavinbmoore · 2026-02-02
77.0%
#14794: fix: parse inline MEDIA: tokens in agent replies
by explainanalyze · 2026-02-12
75.9%
#19868: fix: prevent media token regex from matching markdown bold text
by sanketgautam · 2026-02-18
75.9%
#11443: LINE: fix buffer guards in detectContentType + add tests
by MdRahmatUllah · 2026-02-07
75.8%
#23262: fix(media): prevent PDF binary injection by detecting magic bytes
by SidQin-cyber · 2026-02-22
75.4%
#8048: Media: add regression test for audio text blocks (#7970)
by Abhishek-B-R · 2026-02-03
75.3%
#10257: fix(security): anchor MIME sanitization regex and block fullwidth b...
by nu-gui · 2026-02-06
74.8%
#11160: Media: add missing audio MIME-to-extension mappings (aac, flac, opu...
by lailoo · 2026-02-07
74.8%