#23077: fix: block chunker breaks at arbitrary whitespace after minChars - AI assisted

by TarogStar open 2026-02-22 00:24 View on GitHub →

agents size: M

The block chunker's whitespace fallback was firing too early: as soon as the buffer exceeded `minChars` (e.g. 300), it fell back to splitting at any whitespace instead of waiting for `maxChars` (e.g. 800) to find proper structural break points. This caused mid-phrase breaks like "**Zoom | Access**" during token-by-token streaming in Teams, which broke formatting and made reading more difficult. **Fix:** The whitespace fallback in `#pickBreakIndex()` now only fires when `buffer.length >= maxChars`, giving the preferred break type (sentence, paragraph, etc.) the full `minChars..maxChars` window to find a structural break. ## Summary - **Problem**: Block streaming messages split mid-phrase at arbitrary whitespace past `minChars`, producing broken formatting like bold headers split across messages - **Why it matters**: Broken formatting in Teams/Discord/Telegram makes bot responses hard to read - **What changed**: In `#pickBreakIndex()`, the whitespace-only fallback threshold moved from `minChars` to `maxChars`. In `#pickSoftBreakIndex()`, removed the premature whitespace path. The preferred break type still gets first priority — no changes to preference semantics. - **What did NOT change**: No API changes, no changes to the coalescer or streaming pipeline. Break preference behavior is preserved. - **New**: `breakFallbacks` config option for custom fallback chains. Default for paragraph mode: `["newline", "sentence"]` (matches pre-refactor behavior). Consolidated duplicate type aliases into `BreakPreferenceType`. Added schema help text. ## Change Type (select all) - [x] Bug fix - [x] Refactor - [ ] Feature - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [ ] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [x] Integrations - [ ] API / contracts - [x] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Related #579 (Signal chunking — same premature break behavior) - Related #17790 (Telegram paragraph splitting — related chunker behavior) - Related #21329 (Slack streaming truncation — may benefit from this fix) ## User-visible / Behavior Changes - Streamed block messages now accumulate up to `maxChars` before falling back to whitespace breaks (previously fell back at `minChars`) - Messages break at the preferred structural boundary (sentence, paragraph, newline) within the `minChars..maxChars` window, as originally intended - No config changes needed — existing `blockStreamingChunk` settings work correctly now - New `breakFallbacks` config option allows customizing the fallback chain per break preference ## Security Impact (required) - New permissions/capabilities? `No` - Secrets/tokens handling changed? `No` - New/changed network calls? `No` - Command/tool execution surface changed? `No` - Data access scope changed? `No` ## Repro + Verification ### Environment - OS: Linux (WSL2) - Runtime/container: Node v24.13.1 - Model/provider: qwen3-8b via LMStudio (local) - Integration/channel: MS Teams - Relevant config: `blockStreamingChunk: { breakPreference: "sentence" }`, `blockStreamingCoalesce: { minChars: 300, maxChars: 800, idleMs: 1500 }` ### Steps 1. Configure block streaming with `breakPreference: "sentence"`, `minChars: 300`, `maxChars: 800` 2. Ask the bot to summarize emails (produces multi-line formatted output with bold headers) 3. Observe how streamed messages are split in the channel ### Expected Messages break at sentence boundaries within the 300-800 char window, never mid-phrase. ### Actual (before fix) Messages split at arbitrary whitespace as soon as buffer exceeds 300 chars (minChars), ignoring the 800 char maxChars window. E.g. "**Zoom" in one message and "Access**" in the next. ## Evidence - [x] Failing test/log before + passing after - [x] Trace/log snippets New test file `pi-embedded-block-chunker.sentence.test.ts` reproduces the exact email content that caused bad splits. Tests cover: - Bulk append with force flush - Token-by-token streaming (char-by-char, the real streaming scenario) - Paragraph mode comparison - breakFallbacks deduplication and ordering ## Human Verification (required) - Verified scenarios: Real email summary output in MS Teams — messages now break at sentence boundaries instead of mid-phrase - Edge cases checked: Token-by-token streaming (char-by-char append + drain), bulk append, fence code blocks, all three break preferences - What I did **not** verify: Discord and Telegram channels — they use the same chunker so should benefit equally ## Compatibility / Migration - Backward compatible? `Yes` - Config/env changes? `No` (new `breakFallbacks` option is optional with backward-compatible defaults) - Migration needed? `No` ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: Revert the commits on `pi-embedded-block-chunker.ts` - Files/config to restore: `src/agents/pi-embedded-block-chunker.ts` - Known bad symptoms: Messages accumulating too long without splitting (would indicate maxChars threshold too high in user config) ## Risks and Mitigations - Risk: Messages may accumulate slightly longer before the first split (up to `maxChars` instead of `minChars` before whitespace fallback) - Mitigation: This is the intended behavior — `maxChars` is the configured upper bound. Users who want more frequent splits can lower `maxChars`. Opus 4.6 assisted