#16096: fix(i18n): use Unicode-aware word boundaries for non-ASCII language support
stale
size: S
## Summary
Replaces ASCII-only `\b` word boundaries with Unicode property escapes (`\p{L}`, `\p{N}`) so that mention detection, token matching, and text processing work correctly for **all non-ASCII languages** — Swedish (åäö), French (é), German (ü), Spanish (ñ), Japanese, Chinese, Korean, and more.
## Problem
JavaScript `\b` only recognizes `[a-zA-Z0-9_]` as "word characters". Any letter outside ASCII is treated as a non-word character, which causes:
- **False positives**: `\bÅsa\b` matches inside `HejÅsa` (because `j→Å` is seen as a word→non-word boundary)
- **False positives**: `\början\b` matches inside `Hejörjan`
- **Broken mention detection** for agent names containing non-ASCII characters
- **Broken structural prefix stripping** for non-Latin text (only `[A-Za-z0-9]` was matched)
This affects **any language** that uses characters outside the ASCII range.
## Solution
### New shared utility: `src/auto-reply/unicode-boundaries.ts`
Provides Unicode-aware word boundary helpers using `\p{L}` (any Unicode letter) and `\p{N}` (any Unicode digit) with the `u` flag:
```typescript
// Zero-width assertions (like \b but Unicode-aware)
UNICODE_WORD_START // (?:(?<=^)|(?<=(?:[^\p{L}\p{N}_])))
UNICODE_WORD_END // (?:(?=$)|(?=(?:[^\p{L}\p{N}_])))
UNICODE_NON_WORD // [^\p{L}\p{N}_]
// Helper to wrap a pattern
wrapWordBoundary(pattern) // adds start + end boundaries
```
### Updated files
| File | Change |
|------|--------|
| `src/auto-reply/reply/mentions.ts` | `deriveMentionPatterns` uses `wrapWordBoundary()` instead of `\b` |
| `src/auto-reply/reply/mentions.ts` | `buildMentionRegexes` tries `"iu"` flag first (falls back to `"i"` for user patterns) |
| `src/auto-reply/reply/mentions.ts` | `stripStructuralPrefixes` uses `\p{L}\p{N}` instead of `[A-Za-z0-9]` |
| `src/auto-reply/reply/mentions.ts` | `stripMentions` tries `"giu"` flag first |
| `src/auto-reply/tokens.ts` | `isSilentReplyText` uses Unicode-aware patterns |
### Backward compatibility
All changes try the Unicode `u` flag first and gracefully fall back to the non-Unicode version for user-supplied patterns that may not be `u`-compatible. This ensures no regressions for existing configurations.
## Tests
Comprehensive test suite in `src/auto-reply/unicode-boundaries.test.ts` covering:
- ASCII names (baseline)
- Swedish: Björk, Ärlig, Åsa
- French: François, café
- German: Pück, über
- Spanish: José
- CJK characters: 太郎
- False positive regression test demonstrating the `\b` bug
## Verification
```
// The bug (before fix):
/\bÅsa\b/.test("HejÅsa") // true — WRONG (false positive)
// After fix:
new RegExp(wrapWordBoundary("Åsa"), "iu").test("HejÅsa") // false — CORRECT
new RegExp(wrapWordBoundary("Åsa"), "iu").test("Hej Åsa!") // true — CORRECT
```
Refs: #3460
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
Replaces ASCII-only `\b` word boundaries with Unicode property escapes (`\p{L}`, `\p{N}`) across mention detection, token matching, and structural prefix stripping. This fixes false positives and broken matching for non-ASCII languages (Swedish, French, German, Spanish, CJK, etc.).
- Adds new shared utility `src/auto-reply/unicode-boundaries.ts` with `UNICODE_WORD_START`, `UNICODE_WORD_END`, `UNICODE_NON_WORD`, and `wrapWordBoundary()` helper
- Updates `deriveMentionPatterns` to use `wrapWordBoundary()` instead of `\b`
- Updates `buildMentionRegexes` and `stripMentions` to try `"iu"` flag first with graceful fallback for user-supplied patterns
- Updates `stripStructuralPrefixes` to use `[\p{L}\p{N}...]` instead of `[A-Za-z0-9...]` so non-ASCII sender prefixes are correctly stripped
- Updates `isSilentReplyText` to use Unicode-aware boundary and non-word patterns
- Includes comprehensive test suite covering multiple languages and a regression test demonstrating the original `\b` bug
<h3>Confidence Score: 4/5</h3>
- This PR is safe to merge — the changes are well-scoped, backward-compatible, and thoroughly tested.
- Score of 4 reflects a well-implemented fix with comprehensive tests. The Unicode boundary patterns are correct and verified against multiple languages. The try/catch fallback pattern ensures backward compatibility for user-supplied patterns. One minor point: CJK names embedded without whitespace separators won't match (same as before this PR for practical purposes since \b also couldn't match them), but this is a known limitation of word-boundary-based matching for CJK text and not a regression. Unused imports in the test file were already flagged in a prior review.
- No files require special attention. All changes are straightforward regex replacements with proper fallback handling.
<sub>Last reviewed commit: e87ef6c</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#8504: fix: prevent false positives in isSilentReplyText for CJK content
by hanxiao · 2026-02-04
81.4%
#19675: fix(security): prevent zero-width Unicode chars from bypassing boun...
by williamzujkowski · 2026-02-18
78.1%
#19916: fix: strict silent-reply detection to prevent false positives with ...
by hayoial · 2026-02-18
77.1%
#17686: fix(memory): support non-ASCII characters in FTS query tokenization
by Phineas1500 · 2026-02-16
75.2%
#16894: Fix text truncation splitting surrogate pairs in web-fetch, subagen...
by Clawborn · 2026-02-15
73.4%
#17244: fix: strip TTS tags from agent replies before delivery (#14652)
by robbyczgw-cla · 2026-02-15
73.1%
#16411: fix(agents): support CJK sentence punctuation in block chunker
by ciberponk · 2026-02-14
72.4%
#16733: fix(ui): avoid injected newlines when tool output is hidden
by jp117 · 2026-02-15
71.7%
#19726: Fix HTML entity decoding for astral code points and surrogate-safe ...
by Clawborn · 2026-02-18
71.7%
#11774: fix: add guards for undefined mentionRegexes arrays
by ikvyk · 2026-02-08
70.6%