#8922: feat(voice-call): Add ElevenLabs WebSocket streaming TTS

by mikiships open 2026-02-04 16:07 View on GitHub →

channel: voice-call stale

Cluster: Voice Call and TTS Improvements

## Summary Adds ElevenLabs as a low-latency TTS provider for voice calls via WebSocket streaming. Audio chunks stream directly to Twilio Media Streams as they arrive from ElevenLabs, dramatically reducing time-to-first-audio compared to the existing batch TTS approach. **Relates to #8582** (ElevenLabs integration request) ## What Changed ### New file: `src/elevenlabs-stream.ts` (~250 lines) WebSocket streaming TTS client for ElevenLabs with: - Persistent WebSocket connection pool (60s idle timeout, 15s cleanup interval) - `ulaw_8000` output format — streams directly to Twilio without transcoding - `auto_mode` for optimal chunking from ElevenLabs - 30s overall timeout per TTS request - AbortSignal support for barge-in cancellation - TTFB and timing instrumentation ### Modified: `src/providers/twilio.ts` - Added `ElevenLabsStreamConfig` field and `setElevenLabsStreamConfig()` setter - Modified `playTts()` condition to also check ElevenLabs config - Added ElevenLabs streaming path in `playTtsViaStream()` — preferred over batch TTS when configured, with graceful fallback to existing batch TTS ### Modified: `src/runtime.ts` - Added ElevenLabs config resolution from `config.tts.elevenlabs` (apiKey, voiceId, modelId, voiceSettings) - Wires config to `TwilioProvider.setElevenLabsStreamConfig()` when streaming is enabled ## Configuration ```yaml extensions: voice-call: tts: elevenlabs: apiKey: "sk_..." voiceId: "JBFqnCBsd6RMkjVDRZzb" # any ElevenLabs voice ID modelId: "eleven_flash_v2_5" # optional, this is the default voiceSettings: # optional stability: 0.5 similarityBoost: 0.75 speed: 1.0 ``` When ElevenLabs is not configured, the existing batch TTS path is used unchanged. ## Design Decisions - **Additive only** — all existing functionality is preserved; ElevenLabs is opt-in - **Connection pooling** — WebSocket connections are reused across TTS calls to avoid reconnection overhead - **Direct streaming** — audio chunks are forwarded to Twilio as they arrive (no buffering), giving sub-300ms TTFB - **Graceful degradation** — if ElevenLabs streaming fails, falls through to batch TTS, then to TwiML `<Say>` fallback ## Testing - Tested with live Twilio calls via public webhook endpoint - ElevenLabs WebSocket connects in ~50-100ms, first audio chunk arrives in ~150-300ms - Barge-in (user interruption) works correctly via AbortSignal - Connection pooling verified: subsequent TTS calls reuse WebSocket ## AI Disclosure This PR was AI-assisted (Claude). All code has been reviewed and tested manually with live calls.  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> Adds an opt-in ElevenLabs WebSocket streaming TTS path for Twilio voice calls. The new `elevenlabs-stream.ts` implements a pooled WebSocket client that streams `ulaw_8000` audio chunks via callback, and `TwilioProvider.playTtsViaStream` now prefers this streaming path when an ElevenLabs config is present, falling back to the existing batch TTS streaming and then TwiML `<Say>`. Main concerns are around WebSocket pooling lifecycle: listener cleanup on error/close paths is incomplete, which can leak listeners across pooled reuse and lead to confusing double-callback behavior in long-running processes. There are also a couple of smaller correctness issues (URL encoding and metric accounting) and a security-related footgun in the timing-safe token comparison path. <h3>Confidence Score: 3/5</h3> - This PR is mergeable but has a few correctness issues in the new WebSocket pooling code that should be addressed first. - Core integration approach is reasonable and scoped, but the new pooled WebSocket client has incomplete listener cleanup on error/close paths which can cause memory leaks and stale handlers firing across requests. Remaining findings are lower impact (URL encoding, metrics accuracy, timingSafeEqual usage). - extensions/voice-call/src/elevenlabs-stream.ts  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>