#19073: feat(voice-call): streaming TTS, barge-in, silence filler, hangup, and voice agent config
docs
channel: voice-call
size: XL
Cluster:
Voice Call and TTS Improvements
## What this PR does
This PR brings five major improvements to the voice-call plugin, each making phone conversations with your OpenClaw agent feel dramatically more natural and responsive.
### 🎙️ Streaming TTS with Barge-in
**Before:** The agent had to generate the *entire* audio response before playing any of it — leading to awkward multi-second silences after every turn.
**After:** Audio streams to the caller as it's generated, word by word. First audio arrives in ~200ms instead of 2-4 seconds. Supports both ElevenLabs (native µ-law, zero conversion) and OpenAI (PCM→µ-law resampled in 100ms blocks).
**Barge-in:** If the caller starts talking while the agent is speaking, playback stops immediately and the agent listens. No more talking over each other.
### ⌨️ Silence Filler
**Before:** While the agent was thinking or running tools, the caller heard dead silence — making it unclear whether the call was still connected.
**After:** Gentle keyboard typing sounds play during processing, giving a natural "someone is working on this" feel. Automatically stops the moment the agent starts speaking. Configurable threshold (default 3.5s), sound set, and on/off toggle.
### 📞 Agent-Initiated Hangup
**Before:** The agent could never end a call. Even after saying goodbye, the caller had to hang up manually.
**After:** When the conversation naturally concludes, the agent speaks its farewell and gracefully hangs up after a 1-second buffer. Uses a simple `[END_CALL]` text marker — reliable and provider-agnostic.
### 🎭 Voice Agent Configuration
**Before:** Voice calls used the same agent config as text chats, and the response model was hardcoded to `gpt-4o-mini`.
**After:** Configure a dedicated `responseAgent` for voice (custom personality, skills, workspace) and `responseModel` that inherits from your default model config. The voice system prompt is now TTS-optimized: numbers, dates, and units are spelled out; no markdown.
### 🗣️ ElevenLabs Scribe STT
**Before:** Speech-to-text was OpenAI Realtime only.
**After:** ElevenLabs Scribe v2 is available as an alternative STT provider via `streaming.sttProvider: "elevenlabs-scribe"`. WebSocket-based with tuned VAD thresholds for noisy phone environments and barge-in on partial transcripts for faster responsiveness.
---
Supersedes #9553 (restructured as clean feature commits). Closes #9635.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR adds five major improvements to the voice-call plugin: streaming TTS with barge-in support (ElevenLabs native mu-law + OpenAI PCM→mu-law), ambient silence filler during agent processing, agent-initiated hangup via `[END_CALL]` marker, configurable voice agent identity (`responseAgent`/`responseModel`), and ElevenLabs Scribe v2 as an alternative STT provider.
- **Streaming TTS** is well-implemented with proper abort handling for barge-in, reader cancellation, and partial-audio-aware error recovery that avoids jarring double-responses
- **Barge-in** strategy correctly triggers on partial transcripts (actual speech recognition) rather than raw VAD events, reducing false positives from background noise
- **Config scope change** (voice-call TTS is now fully independent from core `messages.tts`) is a clean breaking change with updated docs
- **Scribe STT provider** has a minor issue: the connection timeout doesn't clean up the WebSocket, which could leave orphaned connections
- **Bug in model resolution**: `response-generator.ts` casts `agents.defaults.model` to an object unconditionally, but this config can also be a plain string — causing string-form configs to be silently ignored
- **Dead code**: `deepMerge`/`isPlainObject` in `telephony-tts.ts` are no longer called after the merge logic was removed
<h3>Confidence Score: 3/5</h3>
- Generally safe to merge, but the model resolution bug in response-generator.ts will cause incorrect model selection for users with string-form agent config
- Score of 3 reflects a real logic bug in model resolution that affects users who configure agents.defaults.model as a string (a supported config format), plus a minor resource leak in the Scribe STT timeout handler. The streaming TTS, barge-in, silence filler, and hangup features are well-implemented with proper cleanup and error handling.
- extensions/voice-call/src/response-generator.ts (model resolution bug), extensions/voice-call/src/providers/stt-elevenlabs-scribe.ts (WebSocket cleanup on timeout)
<sub>Last reviewed commit: 6cab209</sub>
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#8922: feat(voice-call): Add ElevenLabs WebSocket streaming TTS
by mikiships · 2026-02-04
80.1%
#23572: feat(voice): enable voice note conversation loop for Telegram and W...
by davidrudduck · 2026-02-22
79.8%
#21566: feat(voice-call): bridge call transcripts to main agent session
by MegaPhoenix92 · 2026-02-20
78.8%
#19489: fix(voice-call): add echo suppression for TTS playback
by kalichkin · 2026-02-17
77.8%
#8251: fix(voice-call): remove redundant transcript from extraSystemPrompt
by geodeterra · 2026-02-03
77.1%
#16089: fix(tts): clarify directive syntax in prompts and strip malformed tags
by kmixter · 2026-02-14
76.3%
#12597: voice-call: add Asterisk ARI provider + core STT
by w0s1nsk1 · 2026-02-09
76.1%
#23778: feat: chat UI facelift — speech, themes, config categories, and polish
by BunsDev · 2026-02-22
76.0%
#7965: feat(tts): add Speechify as TTS provider
by chaerla · 2026-02-03
75.9%
#10351: feat: Add Mumble voice chat extension
by emadomedher · 2026-02-06
75.8%