#21110: fix(tts): deliver audio via structured mediaUrl instead of MEDIA: text tokens

by hydro13 open 2026-02-19 17:34 View on GitHub →

agents size: S

Cluster: Voice Call and TTS Improvements

## Problem The built-in `tts` tool generates audio to absolute paths (`/tmp/tts-xxx/voice-xxx.opus`) and returns them as `MEDIA:/tmp/...` text tokens. The media parser security policy in `splitMediaFromOutput` blocks absolute paths, so users see raw file paths as text instead of receiving voice messages. The `/tts audio` slash command works fine because it sets `mediaUrl` directly on the reply payload, bypassing text-based parsing entirely. Fixes #14174. ## Solution Option A from the issue: deliver audio through structured tool result fields instead of `MEDIA:` text tokens. ### Changes **Core fix — `tts-tool.ts`:** - Replaced `MEDIA:${audioPath}` text token with `details.mediaUrl` and `details.audioAsVoice` fields - Content text returns `SILENT_REPLY_TOKEN` instead of a `MEDIA:` token that gets blocked **Media extraction — `pi-embedded-subscribe.tools.ts`:** - Added strategy 0 (highest priority): check `details.mediaUrl` / `details.mediaUrls` before falling back to text-based `MEDIA:` parsing - Added `detailsOnly` option to skip text-based extraction when `emitToolOutput` already handles it (prevents duplicates) - New `extractToolResultAudioAsVoice()` helper **Handler — `pi-embedded-subscribe.handlers.tools.ts`:** - Media delivery runs regardless of `shouldEmitToolOutput()`, using `detailsOnly: true` when emit is on - Extracts and passes `audioAsVoice` through the callback **Type + propagation:** - `audioAsVoice?: boolean` added to `onToolResult` callback type - Forwarded through `pi-embedded-subscribe.ts` and `agent-runner-execution.ts` ## Why not just allow /tmp/ in the security policy? Punching holes in the path security policy for specific directories would weaken the sandbox model. The structured approach is cleaner: trusted built-in tools deliver media through typed fields, untrusted LLM text output stays sandboxed.  <h3>Greptile Summary</h3> This PR fixes TTS audio delivery by replacing text-based `MEDIA:` tokens with structured `details.mediaUrl` fields, bypassing security policies that block absolute paths. The changes successfully implement Option A from issue #14174. **Key changes:** - `tts-tool.ts`: Returns `details.mediaUrl` and `details.audioAsVoice` instead of `MEDIA:${audioPath}` text tokens - Media extraction: Added strategy 0 (highest priority) to check `details.mediaUrl`/`details.mediaUrls` before text parsing - `detailsOnly` option prevents duplicate extraction when `emitToolOutput` already handles text-based `MEDIA:` parsing - `audioAsVoice` flag propagated through callback chain to preserve voice-bubble metadata **Note:** This PR also includes an unrelated security commit (31b12562) that strips hidden content from `web_fetch` to prevent prompt injection attacks (#8027). This security fix adds comprehensive HTML sanitization and invisible Unicode stripping. <h3>Confidence Score: 4/5</h3> - This PR is safe to merge with minor considerations about scope - The TTS fix is well-structured and follows a clean pattern of delivering media through typed fields instead of text tokens. The implementation correctly propagates `audioAsVoice` through the callback chain and uses the `detailsOnly` option to avoid duplicate media extraction. The web-fetch security fix is comprehensive with thorough test coverage. However, the PR combines two unrelated features (TTS fix + web-fetch security), which slightly reduces confidence as they should ideally be separate PRs. - No files require special attention - the implementation is clean and follows existing patterns <sub>Last reviewed commit: 663f98e</sub>  <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>