#16411: fix(agents): support CJK sentence punctuation in block chunker
agents
size: XS
## Summary
- fix(agents): support CJK sentence punctuation in block chunker
- Split from our `v2026.2.13` patch train as a single-purpose change for easier review.
## Why
- Keep the diff focused and low-risk so it can be merged or reverted independently.
## Scope
- Branch: `fix/chunker-cjk-sentence-boundaries-en`
- Files changed: 2
- Key files:
- `src/agents/pi-embedded-block-chunker.e2e.test.ts`
- `src/agents/pi-embedded-block-chunker.ts`
## Test Plan
- Suggested local command:
- `./node_modules/.bin/vitest run src/agents/pi-embedded-block-chunker.e2e.test.ts`
- Validation status:
- [ ] CI checks pass
- [ ] Maintainer re-ran local tests
## Risk & Rollback
- Risk: low to medium; impact limited to touched module(s).
- Rollback: revert this PR commit(s) cleanly.
## Co-authorship
- Co-authored by @ciberponk and Codex (GPT-5).
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR extends the block chunker's sentence-boundary detection to support CJK (Chinese, Japanese, Korean) punctuation marks (`。!?;…`). Previously, only Latin punctuation (`.!?`) followed by whitespace or end-of-string triggered sentence breaks. CJK text doesn't use inter-sentence spaces, so these characters are matched without a whitespace lookahead.
- Extracted the sentence boundary regex into a shared module-level constant `SENTENCE_BOUNDARY_RE`
- Extended the regex to match CJK sentence-ending punctuation (`。!?;…`) without requiring trailing whitespace
- Added ASCII `;` (semicolon) to the Latin punctuation group (a minor behavioral change for English text)
- Both `#pickSoftBreakIndex` and `#pickBreakIndex` now use the shared regex
- Added an e2e test validating chunking on CJK punctuation in sentence mode
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge — it's a small, well-scoped regex extension with correct test coverage.
- The change is minimal (3 lines of production code) and focused on a single concern: extending sentence boundary detection to CJK punctuation. The regex is correctly structured with separate handling for Latin (with whitespace lookahead) and CJK (without). The test validates the expected chunking behavior. All BMP characters are used so there are no surrogate pair concerns. The drain loop's buffer-length guards correctly prevent the last segment from being emitted prematurely.
- No files require special attention.
<sub>Last reviewed commit: fa257f9</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#8504: fix: prevent false positives in isSilentReplyText for CJK content
by hanxiao · 2026-02-04
75.8%
#20241: fix(memory-lancedb): consolidate preference keyword/category detect...
by ciberponk · 2026-02-18
73.2%
#16894: Fix text truncation splitting surrogate pairs in web-fetch, subagen...
by Clawborn · 2026-02-15
72.6%
#19916: fix: strict silent-reply detection to prevent false positives with ...
by hayoial · 2026-02-18
72.6%
#16096: fix(i18n): use Unicode-aware word boundaries for non-ASCII language...
by PeterRosdahl · 2026-02-14
72.4%
#10612: fix: trim leading blank lines on first emitted chunk only (#5530)
by 1kuna · 2026-02-06
72.3%
#19675: fix(security): prevent zero-width Unicode chars from bypassing boun...
by williamzujkowski · 2026-02-18
72.2%
#20795: fix(markdown): prevent triple newlines after blockquotes
by novalis133 · 2026-02-19
72.1%
#12064: fix: prevent chunker from truncating messages that fit within limit
by joetomasone · 2026-02-08
71.6%
#17244: fix: strip TTS tags from agent replies before delivery (#14652)
by robbyczgw-cla · 2026-02-15
71.4%