#23262: fix(media): prevent PDF binary injection by detecting magic bytes
size: XS
Cluster:
Media Handling Improvements
## Summary
- **Problem:** The \`looksLikeUtf8Text\` heuristic in \`src/media-understanding/apply.ts\` could misclassify PDF (and ZIP/OOXML) files as UTF-8 text when the file content happened to pass the text detection threshold. This caused raw binary content to be injected into the LLM context as text.
- **Why it matters:** Binary injection corrupts the conversation context, wastes tokens, and can cause model errors. PDF files are common attachments.
- **What changed:** Added a \`BINARY_MAGIC_BYTES\` constant for common binary format signatures (PDF \`%PDF-\`, ZIP/OOXML \`PK\`) and a \`hasBinaryMagicBytes\` helper. \`looksLikeUtf8Text\` now returns \`false\` immediately if magic bytes are detected, before running the text heuristic.
- **What did NOT change:** The existing UTF-8 text detection heuristic is unchanged. Files without matching magic bytes are processed as before.
## Change Type (select all)
- [x] Bug fix
- [ ] Feature
- [ ] Refactor
- [ ] Docs
- [ ] Security hardening
- [ ] Chore/infra
## Scope (select all touched areas)
- [ ] Gateway / orchestration
- [ ] Skills / tool execution
- [ ] Auth / tokens
- [ ] Memory / storage
- [ ] Integrations
- [ ] API / contracts
- [ ] UI / DX
- [ ] CI/CD / infra
## Linked Issue/PR
- Closes #23191
## User-visible / Behavior Changes
- PDF, ZIP, and OOXML files are no longer misclassified as text
- These files are now correctly routed to their dedicated handlers (e.g., PDF.js for PDFs)
## Security Impact (required)
- New permissions/capabilities? \`No\`
- Secrets/tokens handling changed? \`No\`
- New/changed network calls? \`No\`
- Command/tool execution surface changed? \`No\`
- Data access scope changed? \`No\`
## Repro + Verification
### Environment
- OS: macOS 15.3 (arm64)
- Runtime: Node v22+
- Model/provider: Any
### Steps
1. Attach a PDF file to a message
2. Send the message
3. Check that the PDF content is extracted via PDF.js, not injected as raw binary text
### Expected
- PDF is processed by the dedicated PDF handler
### Actual
- Before fix: PDF binary content passed \`looksLikeUtf8Text\` check and was injected as text
- After fix: Magic bytes detection short-circuits, PDF is correctly routed
## Evidence
Magic bytes covered:
| Format | Signature | Bytes |
|--------|-----------|-------|
| PDF | \`%PDF-\` | \`25 50 44 46 2D\` |
| ZIP/OOXML | \`PK..\` | \`50 4B 03 04\` |
| ZIP (empty) | \`PK..\` | \`50 4B 05 06\` |
| ZIP (spanned) | \`PK..\` | \`50 4B 07 08\` |
## Human Verification (required)
- Verified scenarios: Confirmed \`hasBinaryMagicBytes\` correctly identifies PDF and ZIP headers; confirmed \`looksLikeUtf8Text\` returns \`false\` for buffers starting with these signatures
- Edge cases checked: Empty buffer returns \`false\`; buffer shorter than magic bytes is handled safely; text files starting with \`%P\` or \`PK\` (unlikely but possible) are only blocked if the full signature matches
- What I did **not** verify: Live PDF attachment processing (no running gateway)
## Compatibility / Migration
- Backward compatible? \`Yes\`
- Config/env changes? \`No\`
- Migration needed? \`No\`
## Failure Recovery (if this breaks)
- How to disable/revert this change quickly: Remove the \`hasBinaryMagicBytes\` check from \`looksLikeUtf8Text\`
- Files/config to restore: \`src/media-understanding/apply.ts\`
- Known bad symptoms: If a legitimate text file starts with \`%PDF-\` bytes (extremely unlikely), it would be misrouted to binary handling
## Risks and Mitigations
- Risk: Extremely rare text files with coincidental magic byte prefixes
- Mitigation: The magic byte sequences are standardized format signatures; false positives are practically impossible
Most Similar PRs
#17286: fix(media): PDF attachments embedded as raw binary instead of extra...
by yinghaosang · 2026-02-15
75.4%
#4235: fix(media): skip audio in extractFileBlocks + hasBinaryAudioMagic d...
by null-runner · 2026-01-29
69.9%
#11990: Fix media understanding file path suppression + image tool bare-ID ...
by robertbergman2 · 2026-02-08
67.4%
#7454: fix: skip UTF-16 heuristic for audio/video/image MIME types (#7444)
by gavinbmoore · 2026-02-02
66.6%
#5401: fix(media-understanding): detect audio binary by magic bytes to pre...
by RiadJamal07 · 2026-01-31
65.3%
#19675: fix(security): prevent zero-width Unicode chars from bypassing boun...
by williamzujkowski · 2026-02-18
64.4%
#5588: fix(media): skip binary audio in file extraction to prevent false U...
by NSEvent · 2026-01-31
62.6%
#18811: fix(media): require file extension for ambiguous MEDIA: path detection
by aldoeliacim · 2026-02-17
61.9%
#23729: fix : normalize local file paths for Windows compatibility across m...
by jayy-77 · 2026-02-22
61.8%
#23312: fix(gateway): strip inbound metadata in chat history sanitization
by SidQin-cyber · 2026-02-22
61.4%