#8922: feat(voice-call): Add ElevenLabs WebSocket streaming TTS
channel: voice-call
stale
Cluster:
Voice Call and TTS Improvements
## Summary
Adds ElevenLabs as a low-latency TTS provider for voice calls via WebSocket streaming. Audio chunks stream directly to Twilio Media Streams as they arrive from ElevenLabs, dramatically reducing time-to-first-audio compared to the existing batch TTS approach.
**Relates to #8582** (ElevenLabs integration request)
## What Changed
### New file: `src/elevenlabs-stream.ts` (~250 lines)
WebSocket streaming TTS client for ElevenLabs with:
- Persistent WebSocket connection pool (60s idle timeout, 15s cleanup interval)
- `ulaw_8000` output format — streams directly to Twilio without transcoding
- `auto_mode` for optimal chunking from ElevenLabs
- 30s overall timeout per TTS request
- AbortSignal support for barge-in cancellation
- TTFB and timing instrumentation
### Modified: `src/providers/twilio.ts`
- Added `ElevenLabsStreamConfig` field and `setElevenLabsStreamConfig()` setter
- Modified `playTts()` condition to also check ElevenLabs config
- Added ElevenLabs streaming path in `playTtsViaStream()` — preferred over batch TTS when configured, with graceful fallback to existing batch TTS
### Modified: `src/runtime.ts`
- Added ElevenLabs config resolution from `config.tts.elevenlabs` (apiKey, voiceId, modelId, voiceSettings)
- Wires config to `TwilioProvider.setElevenLabsStreamConfig()` when streaming is enabled
## Configuration
```yaml
extensions:
voice-call:
tts:
elevenlabs:
apiKey: "sk_..."
voiceId: "JBFqnCBsd6RMkjVDRZzb" # any ElevenLabs voice ID
modelId: "eleven_flash_v2_5" # optional, this is the default
voiceSettings: # optional
stability: 0.5
similarityBoost: 0.75
speed: 1.0
```
When ElevenLabs is not configured, the existing batch TTS path is used unchanged.
## Design Decisions
- **Additive only** — all existing functionality is preserved; ElevenLabs is opt-in
- **Connection pooling** — WebSocket connections are reused across TTS calls to avoid reconnection overhead
- **Direct streaming** — audio chunks are forwarded to Twilio as they arrive (no buffering), giving sub-300ms TTFB
- **Graceful degradation** — if ElevenLabs streaming fails, falls through to batch TTS, then to TwiML `<Say>` fallback
## Testing
- Tested with live Twilio calls via public webhook endpoint
- ElevenLabs WebSocket connects in ~50-100ms, first audio chunk arrives in ~150-300ms
- Barge-in (user interruption) works correctly via AbortSignal
- Connection pooling verified: subsequent TTS calls reuse WebSocket
## AI Disclosure
This PR was AI-assisted (Claude). All code has been reviewed and tested manually with live calls.
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
Adds an opt-in ElevenLabs WebSocket streaming TTS path for Twilio voice calls. The new `elevenlabs-stream.ts` implements a pooled WebSocket client that streams `ulaw_8000` audio chunks via callback, and `TwilioProvider.playTtsViaStream` now prefers this streaming path when an ElevenLabs config is present, falling back to the existing batch TTS streaming and then TwiML `<Say>`.
Main concerns are around WebSocket pooling lifecycle: listener cleanup on error/close paths is incomplete, which can leak listeners across pooled reuse and lead to confusing double-callback behavior in long-running processes. There are also a couple of smaller correctness issues (URL encoding and metric accounting) and a security-related footgun in the timing-safe token comparison path.
<h3>Confidence Score: 3/5</h3>
- This PR is mergeable but has a few correctness issues in the new WebSocket pooling code that should be addressed first.
- Core integration approach is reasonable and scoped, but the new pooled WebSocket client has incomplete listener cleanup on error/close paths which can cause memory leaks and stale handlers firing across requests. Remaining findings are lower impact (URL encoding, metrics accuracy, timingSafeEqual usage).
- extensions/voice-call/src/elevenlabs-stream.ts
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#8317: fix(tts): add dynamic timeout and retry logic for ElevenLabs TTS
by camtang26 · 2026-02-03
81.5%
#19073: feat(voice-call): streaming TTS, barge-in, silence filler, hangup, ...
by odrobnik · 2026-02-17
80.1%
#7965: feat(tts): add Speechify as TTS provider
by chaerla · 2026-02-03
77.3%
#20794: feat(tts): add Fish Audio provider with full docs, tests & gateway ...
by twangodev · 2026-02-19
76.9%
#7704: fix(voice-call): add authentication to WebSocket media stream endpoint
by coygeek · 2026-02-03
75.7%
#8339: fix(tts): validate ElevenLabs base URL against allowlist
by yubrew · 2026-02-03
75.6%
#19489: fix(voice-call): add echo suppression for TTS playback
by kalichkin · 2026-02-17
74.7%
#7258: feat(tts): add Inworld AI TTS provider
by willsinghwilson · 2026-02-02
74.2%
#21050: security(voice-call): path-based stream token for Twilio WebSocket ...
by richvincent · 2026-02-19
74.0%
#21193: fix(tts): send voice messages as Opus bubbles on Telegram
by aris-katkova · 2026-02-19
73.4%