#19427: feat: add Soniox speech-to-text provider

by matjaz open 2026-02-17 19:53 View on GitHub →

agents size: L

Cluster: Voice Transcription Enhancements

Add Soniox as a new audio transcription provider for the media-understanding system. Soniox uses an async workflow: upload file → create transcription → poll → get transcript. Features: - Multipart file upload - Async transcription with polling - Language hints support (60+ languages) - Excellent for Slovenian and other non-English languages - ~$0.10/hour pricing - Requires SONIOX_API_KEY env var. All polling is handled internally — same transcribeAudio() interface as other providers. Resolves: openclaw/openclaw#7325 ## Summary Describe the problem and fix in 2–5 bullets: - Problem: added - Why it matters: new provider with powerfull models - What changed: new audio transcription provider ## Change Type (select all) - [ ] Bug fix - [x] Feature - [ ] Refactor - [ ] Docs - [ ] Security hardening - [ ] Chore/infra ## Scope (select all touched areas) - [ ] Gateway / orchestration - [ ] Skills / tool execution - [ ] Auth / tokens - [ ] Memory / storage - [ ] Integrations - [ ] API / contracts - [ ] UI / DX - [ ] CI/CD / infra ## Linked Issue/PR - Closes #7325 ## User-visible / Behavior Changes None. ## Security Impact (required) - New permissions/capabilities? `No` - Secrets/tokens handling changed? `No` - New/changed network calls? `Yes` - Command/tool execution surface changed? `No` - Data access scope changed? `No` - If any `Yes`, explain risk + mitigation: New audio transcription provider API calls. ## Repro + Verification ### Environment - OS: - Runtime/container: - Model/provider: - Integration/channel (if any): telegram - Relevant config (redacted): ### Steps 1. send audio message via telegram 2. response is transcription ### Expected - audio transcription ## Human Verification (required) What you personally verified (not just CI), and how: - Verified scenarios: sent audio via telegram. audio transcription is working perfectly ## Compatibility / Migration - Backward compatible? `Yes` - Config/env changes? `No` - Migration needed? `No` ## Failure Recovery (if this breaks) - How to disable/revert this change quickly: just revert code - Files/config to restore: none - Known bad symptoms reviewers should watch for: no ## Risks and Mitigations List only real risks for this PR. Add/remove entries as needed. If none, write `None`.  <h3>Greptile Summary</h3> This PR adds Soniox as a new async audio transcription provider, following the established `MediaUnderstandingProvider` interface. The integration wiring (env key lookup, default model, provider registry, `AUTO_AUDIO_KEY_PROVIDERS`) is done correctly and consistently with existing providers. The main concern is in the core implementation file `soniox/audio.ts`: - **Timeout budget not shared across sequential steps** — `params.timeoutMs` is passed unchanged to each of the four async helpers (`uploadFile`, `createTranscription`, `pollTranscription`, `getTranscript`). Because each step starts its own independent timer from `Date.now()`, the total wall-clock duration can reach up to `4 × timeoutMs` rather than being bounded by it. With the default 60-second audio timeout this allows up to 4 minutes of blocking. - **Unescaped `fileName` in manual multipart body** — the filename is interpolated directly into the `Content-Disposition` header string without escaping double-quotes or CRLF sequences. A malformed or adversarial filename (derived from a user-supplied attachment) can corrupt the multipart structure or inject additional header lines. The OpenAI provider avoids this entirely by using the browser `FormData` API. - **No tests** — unlike `deepgram` and `openai`, no `audio.test.ts` (unit) or `audio.live.test.ts` (live/integration) is provided, leaving the multi-step async workflow, timeout handling, and response parsing untested. <h3>Confidence Score: 2/5</h3> - Not safe to merge without fixing the timeout budget leak and the unescaped filename in the multipart body. - Two logic bugs in the core implementation: (1) the timeout is not tracked as a shared budget across the four sequential async steps, allowing total elapsed time up to 4× the intended limit; (2) the filename is embedded raw into a manually-constructed Content-Disposition header without escaping, which can corrupt the multipart request for filenames containing double-quotes or CRLF. Additionally, no tests are present for any part of the new provider. - src/media-understanding/providers/soniox/audio.ts requires the most attention — both logic issues are contained here, and it is the only file without tests. <sub>Last reviewed commit: a262241</sub>  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>