#17304: feat(gemini): robust handling for non-XML reasoning headers (`Thinking:`, `Analysis:`) & split tags

by YoshiaKefasu open 2026-02-15 16:47 View on GitHub →

gateway agents stale size: L

## Description This PR introduces a robust mechanism to detect and strip "internal monologue" text that leaks into the user chat when models (specifically high-reasoning ones like `google/gemini-3-pro-preview`) deviate from the standard `<think>` XML tag format. ### The Problem (Context) Even with strict system prompts, advanced models sometimes "forget" to use `<think>` tags and instead use structured Markdown headers such as: \`\`\`text Thinking: [Internal reasoning...] Output: [Actual response...] \`\`\` Additionally, streaming chunks often split these headers (e.g., `Thin` + `king:`), causing simple regex-based filters to fail. This results in raw thought processes being exposed to the user, breaking the immersion and flooding the chat with debug-like logs. ### The Solution This patch implements a stateful buffering strategy that acts as a fail-safe for model non-compliance: 1. **Custom Header Detection:** - Added logic to detect and strip blocks starting with keywords like `Thinking:` or `Analysis:` and ending with `Output:`. - This effectively silences untagged internal monologues that the model "hallucinates" as part of its response structure. 2. **Smart Buffering (`detectPartialTagsOrHeaders`):** - Implemented a look-behind buffer to handle fragmented keywords across stream chunks. - Example: If chunk A ends with `Analy` and chunk B starts with `sis:`, the system buffers chunk A and combines it with chunk B to successfully trigger the filter. 3. **State Management (`customHeaderThinking`):** - Added a new state flag `customHeaderThinking` to the session context. - This tracks whether the stream is currently inside a "Custom Header Block" across multiple network packets, ensuring the entire thought block is suppressed until the `Output:` marker is found. ### Why this should be merged As models evolve, their adherence to specific output formatting rules (like strictly using XML tags) can be unpredictable. This change ensures a clean, "magic-like" user experience by preventing internal logic leaks, regardless of whether the model strictly follows the system prompt or invents its own reasoning format. It turns OpenClaw into a more resilient gateway for experimental and high-reasoning models. Fixes https://github.com/openclaw/openclaw/issues/6328 Also related to: - https://github.com/openclaw/openclaw/issues/9675 - https://github.com/openclaw/openclaw/issues/15353 - https://github.com/openclaw/openclaw/issues/5946 --- <img width="1218" height="760" alt="image" src="https://github.com/user-attachments/assets/e53c1228-1784-4a37-9485-5bb0bf878a82" /> > My agent realized their own mistake, as they tend to think in their own way, which caused their thoughts to leak into the chats. They suggested that this works really well for the Gemini 3 Pro and 3 Flash models. If you can read Japanese → [正規表現は正しい。じゃあ、なんで漏れた？.txt](https://github.com/user-attachments/files/25327720/default.txt)  <h3>Greptile Summary</h3> This PR adds robust handling for non-XML reasoning headers (`Thinking:`, `Analysis:`, `Output:`) and improves support for `<think>` tags with attributes (e.g., `<think id="trace-123">`). The changes are well-structured across the codebase: - **Regex hardening**: `THINKING_TAG_SCAN_RE` and `FINAL_TAG_SCAN_RE` now use `\b[^<>]*>` instead of `\s*>`, correctly matching tags with attributes like `<think id="...">`. This is a clean, backward-compatible improvement. - **Custom header stripping**: A new stateful `stripCustomThinkingHeaders` function detects and strips `Thinking: ... Output:` / `Analysis: ... Output:` blocks, preventing internal monologue from leaking to users when models deviate from `<think>` tags. - **Smart buffering**: `detectPartialTagsOrHeaders` handles fragmented keywords across stream chunks (e.g., `"Thin"` + `"king:"`), buffering partial matches until the next chunk resolves them. - **Type annotations**: Minor type narrowing fixes in `message-handler.ts` add explicit `string` annotations to `.find`/`.map`/`.filter` callbacks — harmless TypeScript strictness improvements. - **New tests**: Three new test files cover the key scenarios (custom header stripping, attribute-bearing think tags, and split thinking tags across chunks). Key finding: `END_HEADER_RE` (`/(?:^|\n)(Output:)/`) requires `Output:` to appear at start-of-string or after a newline, but `START_HEADER_RE` accepts any whitespace before `Thinking:`/`Analysis:`. This asymmetry means that single-line patterns like `"Thinking: ... Output: answer"` (space before `Output:`, no newline) silently drop the output content. The current test suite only covers the newline-separated case. <h3>Confidence Score: 3/5</h3> - PR is generally safe to merge but has a regex asymmetry bug that can silently drop output content in certain model response patterns. - The regex improvement for think tags with attributes is solid and well-tested. The custom header detection/stripping is a useful defense-in-depth feature. However, the asymmetry between START_HEADER_RE (accepts spaces) and END_HEADER_RE (requires newline) creates a class of inputs where the Output: end marker is never matched, causing all post-Thinking: content to be silently dropped. The existing tests only cover the newline-separated case, masking this gap. The buffering mechanism is correctly implemented but the buffer flush at message_end uses a throwaway state, which is acceptable given that handleMessageEnd re-processes the full text independently. - `src/agents/pi-embedded-subscribe.ts` — the `END_HEADER_RE` regex definition needs attention to match the same whitespace patterns as `START_HEADER_RE` <sub>Last reviewed commit: 82f9b60</sub>