← Back to PRs

#16411: fix(agents): support CJK sentence punctuation in block chunker

by ciberponk open 2026-02-14 18:41 View on GitHub →
agents size: XS
## Summary - fix(agents): support CJK sentence punctuation in block chunker - Split from our `v2026.2.13` patch train as a single-purpose change for easier review. ## Why - Keep the diff focused and low-risk so it can be merged or reverted independently. ## Scope - Branch: `fix/chunker-cjk-sentence-boundaries-en` - Files changed: 2 - Key files: - `src/agents/pi-embedded-block-chunker.e2e.test.ts` - `src/agents/pi-embedded-block-chunker.ts` ## Test Plan - Suggested local command: - `./node_modules/.bin/vitest run src/agents/pi-embedded-block-chunker.e2e.test.ts` - Validation status: - [ ] CI checks pass - [ ] Maintainer re-ran local tests ## Risk & Rollback - Risk: low to medium; impact limited to touched module(s). - Rollback: revert this PR commit(s) cleanly. ## Co-authorship - Co-authored by @ciberponk and Codex (GPT-5). <!-- greptile_comment --> <h3>Greptile Summary</h3> This PR extends the block chunker's sentence-boundary detection to support CJK (Chinese, Japanese, Korean) punctuation marks (`。!?;…`). Previously, only Latin punctuation (`.!?`) followed by whitespace or end-of-string triggered sentence breaks. CJK text doesn't use inter-sentence spaces, so these characters are matched without a whitespace lookahead. - Extracted the sentence boundary regex into a shared module-level constant `SENTENCE_BOUNDARY_RE` - Extended the regex to match CJK sentence-ending punctuation (`。!?;…`) without requiring trailing whitespace - Added ASCII `;` (semicolon) to the Latin punctuation group (a minor behavioral change for English text) - Both `#pickSoftBreakIndex` and `#pickBreakIndex` now use the shared regex - Added an e2e test validating chunking on CJK punctuation in sentence mode <h3>Confidence Score: 5/5</h3> - This PR is safe to merge — it's a small, well-scoped regex extension with correct test coverage. - The change is minimal (3 lines of production code) and focused on a single concern: extending sentence boundary detection to CJK punctuation. The regex is correctly structured with separate handling for Latin (with whitespace lookahead) and CJK (without). The test validates the expected chunking behavior. All BMP characters are used so there are no surrogate pair concerns. The drain loop's buffer-length guards correctly prevent the last segment from being emitted prematurely. - No files require special attention. <sub>Last reviewed commit: fa257f9</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs