#8504: fix: prevent false positives in isSilentReplyText for CJK content

by hanxiao open 2026-02-04 03:14 View on GitHub →

stale

Cluster: Memory and Language Support Enhancements

## Problem The `isSilentReplyText` function uses `\W*$` to allow trailing non-word characters after the silent reply token. However, in JavaScript regex, `\W` matches **any non-ASCII character**, including CJK (Chinese/Japanese/Korean) characters. This caused false positives where messages containing actual content after `NO_REPLY` were incorrectly filtered: ``` '测试 NO_REPLY 内容' => true // BUG: should be false '好的 NO_REPLY' => true // BUG: should be false ``` ## Root Cause `\W` in JavaScript regex is equivalent to `[^a-zA-Z0-9_]`, which means all Unicode characters outside basic ASCII alphanumerics are considered 'non-word' characters. ## Fix Replace the loose regex with a Unicode-aware pattern using `\p{P}` (Unicode punctuation category) to only allow actual punctuation around the token: ```typescript // Before (buggy) const suffix = new RegExp(\`\\b\${escaped}\\b\\W*$\`); // After (fixed) const pattern = new RegExp(\`^[\\s\\p{P}]*\${escaped}[\\s\\p{P}]*$\`, 'u'); ``` ## Test Results | Input | Before | After | |-------|--------|-------| | `NO_REPLY` | ✅ true | ✅ true | | `NO_REPLY.` | ✅ true | ✅ true | | ` NO_REPLY ` | ✅ true | ✅ true | | `测试 NO_REPLY` | ❌ true | ✅ false | | `NO_REPLY 测试` | ❌ true | ✅ false | | `这条消息有 NO_REPLY 内容` | ❌ true | ✅ false | Added unit tests in `tokens.test.ts` to prevent regression.  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR updates `src/auto-reply/tokens.ts` to make `isSilentReplyText` Unicode-aware by replacing the previous `\W*$`-based suffix check (which treated CJK letters as “non-word”) with a `u`-flag regex that only permits whitespace and Unicode punctuation around the silent-reply token. It also adds Vitest coverage in `src/auto-reply/tokens.test.ts` for whitespace/punctuation cases plus CJK regression cases to prevent false positives. One thing to double-check is the tightened matching semantics: the new pattern matches only when the *entire message* is token + optional whitespace/punctuation, and `\p{P}` may be narrower than desired for real-world “punctuation-like” characters. Separately, the PR introduces a new `package-lock.json`, which may be unintentional given the repo’s pnpm-first workflow. <h3>Confidence Score: 4/5</h3> - This PR is likely safe to merge; changes are localized and covered by targeted unit tests. - The regex change directly addresses the reported `\W`/Unicode behavior and the added tests cover the original CJK false-positive scenarios. Main remaining risks are subtle behavior changes in what characters are considered ignorable around the token (Unicode category coverage) and the accidental addition of a new lockfile. - src/auto-reply/tokens.ts (regex semantics) and package-lock.json (confirm it’s intended).