#23262: fix(media): prevent PDF binary injection by detecting magic bytes

by SidQin-cyber open 2026-02-22 04:48 View on GitHub →

size: XS

## Summary - **Problem:** The \`looksLikeUtf8Text\` heuristic in \`src/media-understanding/apply.ts\` could misclassify PDF (and ZIP/OOXML) files as UTF-8 text when the file content happened to pass the text detection threshold. This caused raw binary content to be injected into the LLM context as text. - **Why it matters:** Binary injection corrupts the conversation context, wastes tokens, and can cause model errors. PDF files are common attachments. - **What changed:** Added a \`BINARY_MAGIC_BYTES\` constant for common binary format signatures (PDF \`%PDF-\`, ZIP/OOXML \`PK\`) and a \`hasBinaryMagicBytes\` helper. \`looksLikeUtf8Text\` now returns \`false\` immediately if magic bytes are detected, before running the text heuristic. - **What did NOT change:** The existing UTF-8 text detection heuristic is unchanged. Files without matching magic bytes are processed as before. ## Change Type (select all) - [x] Bug fix - [ ] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [ ] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #23191 ## User-visible / Behavior Changes - PDF, ZIP, and OOXML files are no longer misclassified as text - These files are now correctly routed to their dedicated handlers (e.g., PDF.js for PDFs) ## Security Impact (required) - New permissions/capabilities? \`No\` - Secrets/tokens handling changed? \`No\` - New/changed network calls? \`No\` - Command/tool execution surface changed? \`No\` - Data access scope changed? \`No\` ## Repro + Verification ### Environment - OS: macOS 15.3 (arm64) - Runtime: Node v22+ - Model/provider: Any ### Steps 1. Attach a PDF file to a message 2. Send the message 3. Check that the PDF content is extracted via PDF.js, not injected as raw binary text ### Expected - PDF is processed by the dedicated PDF handler ### Actual - Before fix: PDF binary content passed \`looksLikeUtf8Text\` check and was injected as text - After fix: Magic bytes detection short-circuits, PDF is correctly routed ## Evidence Magic bytes covered: | Format | Signature | Bytes | |--------|-----------|-------| | PDF | \`%PDF-\` | \`25 50 44 46 2D\` | | ZIP/OOXML | \`PK..\` | \`50 4B 03 04\` | | ZIP (empty) | \`PK..\` | \`50 4B 05 06\` | | ZIP (spanned) | \`PK..\` | \`50 4B 07 08\` | ## Human Verification (required) - Verified scenarios: Confirmed \`hasBinaryMagicBytes\` correctly identifies PDF and ZIP headers; confirmed \`looksLikeUtf8Text\` returns \`false\` for buffers starting with these signatures - Edge cases checked: Empty buffer returns \`false\`; buffer shorter than magic bytes is handled safely; text files starting with \`%P\` or \`PK\` (unlikely but possible) are only blocked if the full signature matches - What I did **not** verify: Live PDF attachment processing (no running gateway) ## Compatibility / Migration - Backward compatible? \`Yes\` - Config/env changes? \`No\` - Migration needed? \`No\` ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: Remove the \`hasBinaryMagicBytes\` check from \`looksLikeUtf8Text\` - Files/config to restore: \`src/media-understanding/apply.ts\` - Known bad symptoms: If a legitimate text file starts with \`%PDF-\` bytes (extremely unlikely), it would be misrouted to binary handling ## Risks and Mitigations - Risk: Extremely rare text files with coincidental magic byte prefixes - Mitigation: The magic byte sequences are standardized format signatures; false positives are practically impossible