← Back to PRs

#16096: fix(i18n): use Unicode-aware word boundaries for non-ASCII language support

by PeterRosdahl open 2026-02-14 08:23 View on GitHub →
stale size: S
## Summary Replaces ASCII-only `\b` word boundaries with Unicode property escapes (`\p{L}`, `\p{N}`) so that mention detection, token matching, and text processing work correctly for **all non-ASCII languages** — Swedish (åäö), French (é), German (ü), Spanish (ñ), Japanese, Chinese, Korean, and more. ## Problem JavaScript `\b` only recognizes `[a-zA-Z0-9_]` as "word characters". Any letter outside ASCII is treated as a non-word character, which causes: - **False positives**: `\bÅsa\b` matches inside `HejÅsa` (because `j→Å` is seen as a word→non-word boundary) - **False positives**: `\början\b` matches inside `Hejörjan` - **Broken mention detection** for agent names containing non-ASCII characters - **Broken structural prefix stripping** for non-Latin text (only `[A-Za-z0-9]` was matched) This affects **any language** that uses characters outside the ASCII range. ## Solution ### New shared utility: `src/auto-reply/unicode-boundaries.ts` Provides Unicode-aware word boundary helpers using `\p{L}` (any Unicode letter) and `\p{N}` (any Unicode digit) with the `u` flag: ```typescript // Zero-width assertions (like \b but Unicode-aware) UNICODE_WORD_START // (?:(?<=^)|(?<=(?:[^\p{L}\p{N}_]))) UNICODE_WORD_END // (?:(?=$)|(?=(?:[^\p{L}\p{N}_]))) UNICODE_NON_WORD // [^\p{L}\p{N}_] // Helper to wrap a pattern wrapWordBoundary(pattern) // adds start + end boundaries ``` ### Updated files | File | Change | |------|--------| | `src/auto-reply/reply/mentions.ts` | `deriveMentionPatterns` uses `wrapWordBoundary()` instead of `\b` | | `src/auto-reply/reply/mentions.ts` | `buildMentionRegexes` tries `"iu"` flag first (falls back to `"i"` for user patterns) | | `src/auto-reply/reply/mentions.ts` | `stripStructuralPrefixes` uses `\p{L}\p{N}` instead of `[A-Za-z0-9]` | | `src/auto-reply/reply/mentions.ts` | `stripMentions` tries `"giu"` flag first | | `src/auto-reply/tokens.ts` | `isSilentReplyText` uses Unicode-aware patterns | ### Backward compatibility All changes try the Unicode `u` flag first and gracefully fall back to the non-Unicode version for user-supplied patterns that may not be `u`-compatible. This ensures no regressions for existing configurations. ## Tests Comprehensive test suite in `src/auto-reply/unicode-boundaries.test.ts` covering: - ASCII names (baseline) - Swedish: Björk, Ärlig, Åsa - French: François, café - German: Pück, über - Spanish: José - CJK characters: 太郎 - False positive regression test demonstrating the `\b` bug ## Verification ``` // The bug (before fix): /\bÅsa\b/.test("HejÅsa") // true — WRONG (false positive) // After fix: new RegExp(wrapWordBoundary("Åsa"), "iu").test("HejÅsa") // false — CORRECT new RegExp(wrapWordBoundary("Åsa"), "iu").test("Hej Åsa!") // true — CORRECT ``` Refs: #3460 <!-- greptile_comment --> <h3>Greptile Summary</h3> Replaces ASCII-only `\b` word boundaries with Unicode property escapes (`\p{L}`, `\p{N}`) across mention detection, token matching, and structural prefix stripping. This fixes false positives and broken matching for non-ASCII languages (Swedish, French, German, Spanish, CJK, etc.). - Adds new shared utility `src/auto-reply/unicode-boundaries.ts` with `UNICODE_WORD_START`, `UNICODE_WORD_END`, `UNICODE_NON_WORD`, and `wrapWordBoundary()` helper - Updates `deriveMentionPatterns` to use `wrapWordBoundary()` instead of `\b` - Updates `buildMentionRegexes` and `stripMentions` to try `"iu"` flag first with graceful fallback for user-supplied patterns - Updates `stripStructuralPrefixes` to use `[\p{L}\p{N}...]` instead of `[A-Za-z0-9...]` so non-ASCII sender prefixes are correctly stripped - Updates `isSilentReplyText` to use Unicode-aware boundary and non-word patterns - Includes comprehensive test suite covering multiple languages and a regression test demonstrating the original `\b` bug <h3>Confidence Score: 4/5</h3> - This PR is safe to merge — the changes are well-scoped, backward-compatible, and thoroughly tested. - Score of 4 reflects a well-implemented fix with comprehensive tests. The Unicode boundary patterns are correct and verified against multiple languages. The try/catch fallback pattern ensures backward compatibility for user-supplied patterns. One minor point: CJK names embedded without whitespace separators won't match (same as before this PR for practical purposes since \b also couldn't match them), but this is a known limitation of word-boundary-based matching for CJK text and not a regression. Unused imports in the test file were already flagged in a prior review. - No files require special attention. All changes are straightforward regex replacements with proper fallback handling. <sub>Last reviewed commit: e87ef6c</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment -->

Most Similar PRs