#19073: feat(voice-call): streaming TTS, barge-in, silence filler, hangup, and voice agent config

by odrobnik open 2026-02-17 09:55 View on GitHub →

docs channel: voice-call size: XL

Cluster: Voice Call and TTS Improvements

## What this PR does This PR brings five major improvements to the voice-call plugin, each making phone conversations with your OpenClaw agent feel dramatically more natural and responsive. ### 🎙️ Streaming TTS with Barge-in **Before:** The agent had to generate the *entire* audio response before playing any of it — leading to awkward multi-second silences after every turn. **After:** Audio streams to the caller as it's generated, word by word. First audio arrives in ~200ms instead of 2-4 seconds. Supports both ElevenLabs (native µ-law, zero conversion) and OpenAI (PCM→µ-law resampled in 100ms blocks). **Barge-in:** If the caller starts talking while the agent is speaking, playback stops immediately and the agent listens. No more talking over each other. ### ⌨️ Silence Filler **Before:** While the agent was thinking or running tools, the caller heard dead silence — making it unclear whether the call was still connected. **After:** Gentle keyboard typing sounds play during processing, giving a natural "someone is working on this" feel. Automatically stops the moment the agent starts speaking. Configurable threshold (default 3.5s), sound set, and on/off toggle. ### 📞 Agent-Initiated Hangup **Before:** The agent could never end a call. Even after saying goodbye, the caller had to hang up manually. **After:** When the conversation naturally concludes, the agent speaks its farewell and gracefully hangs up after a 1-second buffer. Uses a simple `[END_CALL]` text marker — reliable and provider-agnostic. ### 🎭 Voice Agent Configuration **Before:** Voice calls used the same agent config as text chats, and the response model was hardcoded to `gpt-4o-mini`. **After:** Configure a dedicated `responseAgent` for voice (custom personality, skills, workspace) and `responseModel` that inherits from your default model config. The voice system prompt is now TTS-optimized: numbers, dates, and units are spelled out; no markdown. ### 🗣️ ElevenLabs Scribe STT **Before:** Speech-to-text was OpenAI Realtime only. **After:** ElevenLabs Scribe v2 is available as an alternative STT provider via `streaming.sttProvider: "elevenlabs-scribe"`. WebSocket-based with tuned VAD thresholds for noisy phone environments and barge-in on partial transcripts for faster responsiveness. --- Supersedes #9553 (restructured as clean feature commits). Closes #9635.  <h3>Greptile Summary</h3> This PR adds five major improvements to the voice-call plugin: streaming TTS with barge-in support (ElevenLabs native mu-law + OpenAI PCM→mu-law), ambient silence filler during agent processing, agent-initiated hangup via `[END_CALL]` marker, configurable voice agent identity (`responseAgent`/`responseModel`), and ElevenLabs Scribe v2 as an alternative STT provider. - **Streaming TTS** is well-implemented with proper abort handling for barge-in, reader cancellation, and partial-audio-aware error recovery that avoids jarring double-responses - **Barge-in** strategy correctly triggers on partial transcripts (actual speech recognition) rather than raw VAD events, reducing false positives from background noise - **Config scope change** (voice-call TTS is now fully independent from core `messages.tts`) is a clean breaking change with updated docs - **Scribe STT provider** has a minor issue: the connection timeout doesn't clean up the WebSocket, which could leave orphaned connections - **Bug in model resolution**: `response-generator.ts` casts `agents.defaults.model` to an object unconditionally, but this config can also be a plain string — causing string-form configs to be silently ignored - **Dead code**: `deepMerge`/`isPlainObject` in `telephony-tts.ts` are no longer called after the merge logic was removed <h3>Confidence Score: 3/5</h3> - Generally safe to merge, but the model resolution bug in response-generator.ts will cause incorrect model selection for users with string-form agent config - Score of 3 reflects a real logic bug in model resolution that affects users who configure agents.defaults.model as a string (a supported config format), plus a minor resource leak in the Scribe STT timeout handler. The streaming TTS, barge-in, silence filler, and hangup features are well-implemented with proper cleanup and error handling. - extensions/voice-call/src/response-generator.ts (model resolution bug), extensions/voice-call/src/providers/stt-elevenlabs-scribe.ts (WebSocket cleanup on timeout) <sub>Last reviewed commit: 6cab209</sub>