← Back to PRs

#9089: feat(memory): semantic markdown chunking for better recall

by Kuroro92100 open 2026-02-04 21:19 View on GitHub →
stale
## Summary - Replace fixed character-based chunking with **semantic chunking** that respects markdown structure - **Preserves code blocks** intact (never splits mid-code) - **Keeps lists** together when possible - **Adds header context** to chunks (e.g. `## API > ### Auth` prefix) for better recall - Splits by natural boundaries: sections → paragraphs → sentences - Legacy chunking still available via `semantic: false` option ### Problem The current `chunkMarkdown` function splits content at arbitrary character boundaries (~1600 chars). This means: - Code blocks get split mid-function - Chunks lose their section context (a chunk might say "Deadline: Sept 2025" without knowing it's from "## PSG Facturation") - Lists get broken apart - `memory_search` returns fragments that require `memory_get` follow-ups to understand ### Solution New `chunkMarkdownSemantic()` function that: 1. Parses markdown into blocks (headers, code, lists, paragraphs) 2. Tracks active header hierarchy for context 3. Prefixes each chunk with its header breadcrumb 4. Never splits code blocks 5. Keeps lists together when they fit 6. Falls back to sentence-level splitting for oversized paragraphs ### Backwards Compatibility - `chunkMarkdown()` uses semantic chunking by default - Pass `{ semantic: false }` to get the legacy behavior - Old function preserved as `chunkMarkdownLegacy()` ## Test plan - [x] 25 tests passing (19 new) - [x] Header context propagation - [x] Code block preservation - [x] List grouping - [x] Paragraph/sentence splitting - [x] Empty content and edge cases - [x] Legacy mode backwards compatibility - [x] Lint clean (oxlint + oxfmt) <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR replaces the legacy fixed-character markdown chunking with a new semantic chunker that parses markdown into headers/code/lists/paragraphs, carries a header “breadcrumb” as context, and avoids splitting code blocks and (when possible) lists. `chunkMarkdown()` now defaults to semantic mode, with the previous behavior kept as `chunkMarkdownLegacy()` and selectable via `{ semantic: false }`. The test suite is expanded to cover block parsing, context propagation, and legacy compatibility. <h3>Confidence Score: 3/5</h3> - This PR has clear functional intent, but introduces metadata/behavior regressions that should be fixed before merge. - Semantic chunking is well-covered by tests, but there are definite correctness issues in chunk metadata (sentence-split startLine, list endLine) and a behavior regression where `overlap` is accepted but ignored in the new default path. - src/memory/internal.ts <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs