#17286: fix(media): PDF attachments embedded as raw binary instead of extracted text (#17198)

by yinghaosang open 2026-02-15 16:12 View on GitHub →

stale size: M trusted-contributor

## Summary PDF attachments get dumped as ~200KB of raw binary into the agent context instead of going through `extractPdfContent()` for proper text extraction. Happens because `looksLikeUtf8Text()` sees the `%PDF-1.7` header as valid ASCII and overrides the MIME type to `text/plain`. Closes #17198 lobster-biscuit ## Root Cause In `extractFileBlocks()` (`src/media-understanding/apply.ts`), the MIME resolution lets `textHint` override `rawMime` unconditionally. PDF headers are mostly printable ASCII, so `looksLikeUtf8Text()` returns `true`, `textHint` becomes `"text/plain"`, and `extractFileContentFromSource()` never hits the `application/pdf` branch. ## Changes - Before: `mimeType` always prefers `textHint` over `rawMime`, so PDFs get `text/plain` and the raw binary is inlined - After: when `rawMime` is specifically `application/pdf` and there's no extension-based override (`forcedTextMimeResolved`), the original MIME is preserved so the PDF extraction path is used ## Tests `apply.test.ts` — 4 tests covering the fix: 1. **MIME preservation**: writes a minimal valid PDF with `MediaType: "application/pdf"`, confirms the output block has `mime="application/pdf"` (not `text/plain`). Fails before fix, passes after. 2. **Text extraction end-to-end**: writes a PDF with embedded text content (`Hello World` via BT/Tj operators), confirms the extracted text appears in the output block and raw `%PDF` binary does not — verifying the full PDF extraction pipeline runs. 3. **Extension-based override still works**: a `.txt` file with `MediaType: "application/pdf"` correctly resolves to `text/plain` via the extension-based path (`forcedTextMimeResolved`), ensuring the fix doesn't break the extension override logic. 4. **Non-PDF types unaffected**: `application/octet-stream` is still correctly skipped by `isBinaryMediaMime`, confirming the fix is scoped to `application/pdf` only. All 28 tests in `src/media-understanding/` pass. `pnpm build` and `pnpm lint` pass.  <h3>Greptile Summary</h3> Fixes a bug where PDF attachments were dumped as raw binary into the agent context instead of going through the `extractPdfContent()` text extraction pipeline. The root cause was that `looksLikeUtf8Text()` misclassified PDF headers (which are mostly printable ASCII) as valid text, causing the MIME type to be overridden from `application/pdf` to `text/plain`. - The fix in `extractFileBlocks()` preserves the original `application/pdf` MIME type when there's no extension-based override, ensuring the PDF extraction path in `extractFileContentFromSource()` is correctly invoked - The condition is narrowly scoped to `normalizedRawMime === "application/pdf"` (exact match), avoiding unintended side effects on other MIME types - The logging condition is tightened to only fire when a MIME override actually takes effect - Four new tests cover: MIME preservation, end-to-end text extraction, extension-based override compatibility, and non-PDF type behavior <h3>Confidence Score: 4/5</h3> - This PR is safe to merge — the fix is narrowly scoped to a specific MIME type and well-covered by tests. - The logic change is minimal and targeted: an exact-match condition for `application/pdf` that preserves the original MIME instead of letting the text heuristic override it. The extension-based override path (`forcedTextMimeResolved`) is explicitly preserved. Four new tests cover the fix, the extension override path, and non-PDF behavior. The downstream `extractFileContentFromSource` already has proper `application/pdf` handling, and failure cases are caught by the existing try/catch. Score is 4 rather than 5 because the new test file doesn't clean up temporary directories (though this follows existing patterns in the codebase). - No files require special attention. <sub>Last reviewed commit: 1cdbf5d</sub>