#17686: fix(memory): support non-ASCII characters in FTS query tokenization
stale
size: S
## Summary
Fixes #17672 — `buildFtsQuery()` uses `/[A-Za-z0-9_]+/g` to tokenize search queries, which only matches ASCII. Any query in CJK or other non-Latin scripts extracts 0 tokens, causing FTS (BM25 keyword search) to be completely skipped.
- Changes the regex to `/[\p{L}\p{N}_]+/gu` (Unicode property escapes) so characters from all scripts are tokenized correctly
- Adds test cases for Chinese, Japanese, Korean, and mixed CJK+English queries
## AI Disclosure
AI-assisted (Claude). Fully tested — build, lint, format, and all unit tests pass.
## Test plan
- [x] `pnpm build` passes
- [x] `pnpm check` passes (lint + format)
- [x] Existing `buildFtsQuery` tests still pass (ASCII behavior unchanged)
- [x] New tests pass: `金银价格` → `"金银价格"`, `hello 世界` → `"hello" AND "世界"`, Japanese/Korean queries tokenized correctly
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Fixed Unicode tokenization in FTS queries and prevented TUI crashes from overflowing lines.
**Memory FTS fix**: Changed `buildFtsQuery()` regex from `/[A-Za-z0-9_]+/g` to `/[\p{L}\p{N}_]+/gu` to properly tokenize CJK and non-Latin scripts. Previously, queries in Chinese, Japanese, Korean, etc. would extract 0 tokens and skip FTS entirely. The new regex uses Unicode property escapes with the `u` flag to match letters and numbers from all scripts.
**TUI crash fix**: Added `SafeContainer` wrapper that truncates lines exceeding terminal width before pi-tui's render verification. This prevents hard crashes when rendered content is even 1 character over width. Uses pi-tui's own `visibleWidth()` and `truncateToWidth()` utilities for consistency.
Both fixes include comprehensive test coverage for ASCII, CJK text, and edge cases.
<h3>Confidence Score: 5/5</h3>
- This PR is safe to merge with minimal risk
- Both fixes are well-tested, isolated changes that address specific bugs. The memory FTS regex change is a drop-in Unicode-aware replacement with identical behavior for ASCII. The SafeContainer is a defensive wrapper that only truncates when necessary, using pi-tui's own utilities. All existing tests pass and new tests comprehensively cover the fixed behavior.
- No files require special attention
<sub>Last reviewed commit: 80760bd</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#8504: fix: prevent false positives in isSilentReplyText for CJK content
by hanxiao · 2026-02-04
76.8%
#16096: fix(i18n): use Unicode-aware word boundaries for non-ASCII language...
by PeterRosdahl · 2026-02-14
75.2%
#4479: fix(tui): prevent crash when search matches ANSI escape sequences
by bee4come · 2026-01-30
74.9%
#15339: fix: BM25 score normalization and FTS5 query join operator
by echoVic · 2026-02-13
74.9%
#6260: fix(tui): prevent width overflow crashes from nested ANSI escape codes
by 0xktn · 2026-02-01
74.6%
#19920: fix(memory): populate FTS index in FTS-only mode so search returns ...
by forketyfork · 2026-02-18
74.4%
#20516: fix(tui): preserve streamed text on finalize for pure text responses
by MisterGuy420 · 2026-02-19
74.2%
#8706: fix(memory): fall back to better-sqlite3 when node:sqlite lacks FTS5
by ElmerProject · 2026-02-04
74.1%
#16894: Fix text truncation splitting surrogate pairs in web-fetch, subagen...
by Clawborn · 2026-02-15
73.9%
#19726: Fix HTML entity decoding for astral code points and surrogate-safe ...
by Clawborn · 2026-02-18
73.6%