#8955: feat(tts): Add Kokoro-82M as first-class TTS provider

by emadomedher open 2026-02-04 17:11 View on GitHub →

stale

Cluster: Text-to-Speech Provider Enhancements

## Summary Adds **Kokoro-82M** as a first-class TTS provider - the fastest local text-to-speech system available. ## Provider Overview ### Kokoro-82M TTS - **Speed**: 35-100x realtime on CUDA GPUs (sub-0.3s for any text length) - **Size**: Only 82 million parameters (tiny, efficient) - **Quality**: Comparable to much larger models - **Voices**: 67 voices across 8 languages (English, Japanese, Chinese, Hindi, Italian, French, Portuguese) - **Voice Mixing**: Supports blending voices (e.g., `af_bella+jf_alpha`) - **License**: Apache 2.0 (fully open) - **OpenAI-compatible API**: Drop-in replacement for existing tools ## Configuration **⚠️ Important:** Kokoro requires `enabled: true` to be explicitly set in the configuration, even when specified as the primary provider. ```json { "messages": { "tts": { "provider": "kokoro", "kokoro": { "enabled": true, "baseUrl": "http://localhost:8102", "voice": "af_bella" } } } } ``` ## Auth Profiles Like other first-class providers, Kokoro requires an auth profile entry in `~/.openclaw/agents/main/agent/auth-profiles.json`, even for local services: ```json { "profiles": { "kokoro:local": { "type": "token", "provider": "kokoro", "token": "not-needed" } } } ``` Without this entry, TTS calls will silently fail with "No API key found" errors. ## Changes - Added Kokoro to TtsProvider type - Added Kokoro config block in TtsConfig with voice mixing support - Added Zod schema validation for Kokoro - Implemented `kokoroTTS()` function with OpenAI-compatible API calls - Added Kokoro to provider fallback chain - Added config resolution with defaults (localhost:8102, af_bella voice) ## Installation Kokoro can be self-hosted using the [Kokoro-FastAPI](https://github.com/remsky/Kokoro-FastAPI) server: ```bash git clone https://github.com/remsky/Kokoro-FastAPI.git cd Kokoro-FastAPI ./start-gpu.sh # For CUDA support # or ./start-cpu.sh # For CPU-only ``` ## Testing - Tested with local Kokoro server on CUDA GPU - Verified voice message generation on Matrix - Confirmed auth profile requirement - Tested voice mixing feature - Measured 35-100x realtime speed on RTX 5080 ## Resources - Model: https://huggingface.co/hexgrad/Kokoro-82M - Server: https://github.com/remsky/Kokoro-FastAPI - Benchmark: https://www.inferless.com/learn/comparing-different-text-to-speech---tts--models-part-2  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> Adds a new `kokoro` TTS provider across config types, Zod validation, and runtime TTS dispatch. The provider is resolved with defaults (localhost:8102, `af_bella`, speed=1.0), added to the provider fallback order, and implemented via an OpenAI-compatible `/v1/audio/speech` request that writes the returned audio buffer to a temp file for downstream message sending. <h3>Confidence Score: 3/5</h3> - This PR is close to mergeable but has a couple of runtime-behavior issues to address first. - Kokoro integration is self-contained and follows existing provider patterns, but there are correctness issues around the new resolved config shape assumptions and audio compatibility flagging that can affect message sending behavior. - src/tts/tts.ts  **Context used:** - Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=fd949e91-5c3a-4ab5-90a1-cbe184fd6ce8)) - Context from `dashboard` - AGENTS.md ([source](https://app.greptile.com/review/custom-context?memory=0d0c8278-ef8e-4d6c-ab21-f5527e322f13))