← Back to PRs

#17686: fix(memory): support non-ASCII characters in FTS query tokenization

by Phineas1500 open 2026-02-16 02:51 View on GitHub →
stale size: S
## Summary Fixes #17672 — `buildFtsQuery()` uses `/[A-Za-z0-9_]+/g` to tokenize search queries, which only matches ASCII. Any query in CJK or other non-Latin scripts extracts 0 tokens, causing FTS (BM25 keyword search) to be completely skipped. - Changes the regex to `/[\p{L}\p{N}_]+/gu` (Unicode property escapes) so characters from all scripts are tokenized correctly - Adds test cases for Chinese, Japanese, Korean, and mixed CJK+English queries ## AI Disclosure AI-assisted (Claude). Fully tested — build, lint, format, and all unit tests pass. ## Test plan - [x] `pnpm build` passes - [x] `pnpm check` passes (lint + format) - [x] Existing `buildFtsQuery` tests still pass (ASCII behavior unchanged) - [x] New tests pass: `金银价格` → `"金银价格"`, `hello 世界` → `"hello" AND "世界"`, Japanese/Korean queries tokenized correctly <!-- greptile_comment --> <h3>Greptile Summary</h3> Fixed Unicode tokenization in FTS queries and prevented TUI crashes from overflowing lines. **Memory FTS fix**: Changed `buildFtsQuery()` regex from `/[A-Za-z0-9_]+/g` to `/[\p{L}\p{N}_]+/gu` to properly tokenize CJK and non-Latin scripts. Previously, queries in Chinese, Japanese, Korean, etc. would extract 0 tokens and skip FTS entirely. The new regex uses Unicode property escapes with the `u` flag to match letters and numbers from all scripts. **TUI crash fix**: Added `SafeContainer` wrapper that truncates lines exceeding terminal width before pi-tui's render verification. This prevents hard crashes when rendered content is even 1 character over width. Uses pi-tui's own `visibleWidth()` and `truncateToWidth()` utilities for consistency. Both fixes include comprehensive test coverage for ASCII, CJK text, and edge cases. <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk - Both fixes are well-tested, isolated changes that address specific bugs. The memory FTS regex change is a drop-in Unicode-aware replacement with identical behavior for ASCII. The SafeContainer is a defensive wrapper that only truncates when necessary, using pi-tui's own utilities. All existing tests pass and new tests comprehensively cover the fixed behavior. - No files require special attention <sub>Last reviewed commit: 80760bd</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs