#8428: voice-call: add low-latency streaming infrastructure
channel: voice-call
stale
Cluster:
Voice Call Enhancements and Fixes
## Summary
Major improvements to the voice call extension for **faster response times**. Previously, users had to wait for the entire LLM response before hearing anything. Now, audio starts playing as soon as the AI generates its first sentence.
## What's New
### Faster Speech-to-Text (Deepgram Flux)
- **Model-based end-of-turn detection**: Instead of waiting for silence (VAD), the model predicts when you've finished speaking
- **Speculative processing**: Starts generating a response *before* you finish speaking, then uses it immediately if the prediction was correct
- **Native telephony audio**: Accepts mu-law 8kHz directly from Twilio (no conversion overhead)
### Faster Text-to-Speech (Cartesia)
- **Persistent WebSocket connection**: Eliminates per-request connection overhead
- **Native mu-law output**: No PCM→mu-law conversion needed
- **Streaming chunks**: Audio starts playing while still being generated
### Streaming LLM → TTS Pipeline
- **Sentence-by-sentence delivery**: First sentence plays while LLM generates the rest
- **Barge-in support**: Interrupt the AI mid-response by speaking
- **Graceful cancellation**: If you continue speaking after an early prediction, speculative work is cancelled
## Before vs After
```
BEFORE: You stop speaking → Silence detection → Full LLM response → TTS → Audio
(You wait for the entire response before hearing anything)
AFTER: You stop speaking → Model detects end → First sentence → Audio
(Audio starts immediately, rest streams in parallel)
```
## Configuration
```yaml
plugins:
entries:
voice-call:
config:
streaming:
enabled: true
sttProvider: "deepgram-flux" # or "openai-realtime"
deepgramApiKey: "..."
tts:
provider: "cartesia"
cartesia:
apiKey: "..."
voiceId: "..."
```
## Test plan
- [ ] Test call with Deepgram Flux STT
- [ ] Test call with Cartesia TTS
- [ ] Verify sentence streaming (ask for something long like "explain quantum computing")
- [ ] Test barge-in during response (interrupt the AI)
- [ ] Test EagerEndOfTurn speculation (pause briefly mid-sentence)
🤖 Generated with [Claude Code](https://claude.ai/code)
Most Similar PRs
#19073: feat(voice-call): streaming TTS, barge-in, silence filler, hangup, ...
by odrobnik · 2026-02-17
72.7%
#10447: feat(voice-call): add Deepgram STT provider
by chrharri · 2026-02-06
68.0%
#8922: feat(voice-call): Add ElevenLabs WebSocket streaming TTS
by mikiships · 2026-02-04
67.4%
#11965: feat(ui): add speech-to-text dictation to web chat via Deepgram Flux
by billgetman · 2026-02-08
63.6%
#9041: feat(tts): Add post-processing hook for voice modulation
by robottwo · 2026-02-04
63.6%
#19489: fix(voice-call): add echo suppression for TTS playback
by kalichkin · 2026-02-17
62.5%
#21566: feat(voice-call): bridge call transcripts to main agent session
by MegaPhoenix92 · 2026-02-20
61.9%
#8251: fix(voice-call): remove redundant transcript from extraSystemPrompt
by geodeterra · 2026-02-03
61.8%
#13389: feat(telegram): support native voice notes with automatic OGG/Opus ...
by leavingme · 2026-02-10
61.1%
#8317: fix(tts): add dynamic timeout and retry logic for ElevenLabs TTS
by camtang26 · 2026-02-03
60.9%