#9089: feat(memory): semantic markdown chunking for better recall
stale
Cluster:
Memory Management Enhancements
## Summary
- Replace fixed character-based chunking with **semantic chunking** that respects markdown structure
- **Preserves code blocks** intact (never splits mid-code)
- **Keeps lists** together when possible
- **Adds header context** to chunks (e.g. `## API > ### Auth` prefix) for better recall
- Splits by natural boundaries: sections → paragraphs → sentences
- Legacy chunking still available via `semantic: false` option
### Problem
The current `chunkMarkdown` function splits content at arbitrary character boundaries (~1600 chars). This means:
- Code blocks get split mid-function
- Chunks lose their section context (a chunk might say "Deadline: Sept 2025" without knowing it's from "## PSG Facturation")
- Lists get broken apart
- `memory_search` returns fragments that require `memory_get` follow-ups to understand
### Solution
New `chunkMarkdownSemantic()` function that:
1. Parses markdown into blocks (headers, code, lists, paragraphs)
2. Tracks active header hierarchy for context
3. Prefixes each chunk with its header breadcrumb
4. Never splits code blocks
5. Keeps lists together when they fit
6. Falls back to sentence-level splitting for oversized paragraphs
### Backwards Compatibility
- `chunkMarkdown()` uses semantic chunking by default
- Pass `{ semantic: false }` to get the legacy behavior
- Old function preserved as `chunkMarkdownLegacy()`
## Test plan
- [x] 25 tests passing (19 new)
- [x] Header context propagation
- [x] Code block preservation
- [x] List grouping
- [x] Paragraph/sentence splitting
- [x] Empty content and edge cases
- [x] Legacy mode backwards compatibility
- [x] Lint clean (oxlint + oxfmt)
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR replaces the legacy fixed-character markdown chunking with a new semantic chunker that parses markdown into headers/code/lists/paragraphs, carries a header “breadcrumb” as context, and avoids splitting code blocks and (when possible) lists. `chunkMarkdown()` now defaults to semantic mode, with the previous behavior kept as `chunkMarkdownLegacy()` and selectable via `{ semantic: false }`. The test suite is expanded to cover block parsing, context propagation, and legacy compatibility.
<h3>Confidence Score: 3/5</h3>
- This PR has clear functional intent, but introduces metadata/behavior regressions that should be fixed before merge.
- Semantic chunking is well-covered by tests, but there are definite correctness issues in chunk metadata (sentence-split startLine, list endLine) and a behavior regression where `overlap` is accepted but ignored in the new default path.
- src/memory/internal.ts
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#12737: feat: add maxLines option for memory chunk splitting
by fastroc · 2026-02-09
75.8%
#14402: fix(feishu): chunk large documents for write/append to avoid API 40...
by lml2468 · 2026-02-12
73.6%
#18919: feat: importance-weighted temporal decay for memory search
by ruypang · 2026-02-17
73.3%
#19967: feat(memory): add semantic clustering and enhanced MMR
by alihassan6520 · 2026-02-18
73.1%
#15251: feat(web-fetch): send Accept: text/markdown header for Cloudflare M...
by wujieli0207 · 2026-02-13
73.0%
#15307: fix(memory): handle mixed/no-results QMD query output
by MohammadErfan-Jabbari · 2026-02-13
73.0%
#21217: fix: memory prune command to prevent unbounded MEMORY.md growth
by theognis1002 · 2026-02-19
72.8%
#20795: fix(markdown): prevent triple newlines after blockquotes
by novalis133 · 2026-02-19
72.5%
#18655: fix(mattermost): preserve markdown formatting and native tables
by echo931 · 2026-02-16
71.7%
#10612: fix: trim leading blank lines on first emitted chunk only (#5530)
by 1kuna · 2026-02-06
71.4%