#19427: feat: add Soniox speech-to-text provider
agents
size: L
Cluster:
Voice Transcription Enhancements
Add Soniox as a new audio transcription provider for the media-understanding system. Soniox uses an async workflow: upload file → create transcription → poll → get transcript.
Features:
- Multipart file upload
- Async transcription with polling
- Language hints support (60+ languages)
- Excellent for Slovenian and other non-English languages
- ~$0.10/hour pricing
- Requires SONIOX_API_KEY env var.
All polling is handled internally — same transcribeAudio() interface as other providers.
Resolves: openclaw/openclaw#7325
## Summary
Describe the problem and fix in 2–5 bullets:
- Problem: added
- Why it matters: new provider with powerfull models
- What changed: new audio transcription provider
## Change Type (select all)
- [ ] Bug fix
- [x] Feature
- [ ] Refactor
- [ ] Docs
- [ ] Security hardening
- [ ] Chore/infra
## Scope (select all touched areas)
- [ ] Gateway / orchestration
- [ ] Skills / tool execution
- [ ] Auth / tokens
- [ ] Memory / storage
- [ ] Integrations
- [ ] API / contracts
- [ ] UI / DX
- [ ] CI/CD / infra
## Linked Issue/PR
- Closes #7325
## User-visible / Behavior Changes
None.
## Security Impact (required)
- New permissions/capabilities? `No`
- Secrets/tokens handling changed? `No`
- New/changed network calls? `Yes`
- Command/tool execution surface changed? `No`
- Data access scope changed? `No`
- If any `Yes`, explain risk + mitigation:
New audio transcription provider API calls.
## Repro + Verification
### Environment
- OS:
- Runtime/container:
- Model/provider:
- Integration/channel (if any): telegram
- Relevant config (redacted):
### Steps
1. send audio message via telegram
2. response is transcription
### Expected
- audio transcription
## Human Verification (required)
What you personally verified (not just CI), and how:
- Verified scenarios: sent audio via telegram. audio transcription is working perfectly
## Compatibility / Migration
- Backward compatible? `Yes`
- Config/env changes? `No`
- Migration needed? `No`
## Failure Recovery (if this breaks)
- How to disable/revert this change quickly: just revert code
- Files/config to restore: none
- Known bad symptoms reviewers should watch for: no
## Risks and Mitigations
List only real risks for this PR. Add/remove entries as needed. If none, write `None`.
<!-- greptile_comment -->
<h3>Greptile Summary</h3>
This PR adds Soniox as a new async audio transcription provider, following the established `MediaUnderstandingProvider` interface. The integration wiring (env key lookup, default model, provider registry, `AUTO_AUDIO_KEY_PROVIDERS`) is done correctly and consistently with existing providers.
The main concern is in the core implementation file `soniox/audio.ts`:
- **Timeout budget not shared across sequential steps** — `params.timeoutMs` is passed unchanged to each of the four async helpers (`uploadFile`, `createTranscription`, `pollTranscription`, `getTranscript`). Because each step starts its own independent timer from `Date.now()`, the total wall-clock duration can reach up to `4 × timeoutMs` rather than being bounded by it. With the default 60-second audio timeout this allows up to 4 minutes of blocking.
- **Unescaped `fileName` in manual multipart body** — the filename is interpolated directly into the `Content-Disposition` header string without escaping double-quotes or CRLF sequences. A malformed or adversarial filename (derived from a user-supplied attachment) can corrupt the multipart structure or inject additional header lines. The OpenAI provider avoids this entirely by using the browser `FormData` API.
- **No tests** — unlike `deepgram` and `openai`, no `audio.test.ts` (unit) or `audio.live.test.ts` (live/integration) is provided, leaving the multi-step async workflow, timeout handling, and response parsing untested.
<h3>Confidence Score: 2/5</h3>
- Not safe to merge without fixing the timeout budget leak and the unescaped filename in the multipart body.
- Two logic bugs in the core implementation: (1) the timeout is not tracked as a shared budget across the four sequential async steps, allowing total elapsed time up to 4× the intended limit; (2) the filename is embedded raw into a manually-constructed Content-Disposition header without escaping, which can corrupt the multipart request for filenames containing double-quotes or CRLF. Additionally, no tests are present for any part of the new provider.
- src/media-understanding/providers/soniox/audio.ts requires the most attention — both logic issues are contained here, and it is the only file without tests.
<sub>Last reviewed commit: a262241</sub>
<!-- greptile_other_comments_section -->
<sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>
<!-- /greptile_comment -->
Most Similar PRs
#14208: feat(media): add AssemblyAI audio transcription provider
by jmoraispk · 2026-02-11
82.2%
#7965: feat(tts): add Speechify as TTS provider
by chaerla · 2026-02-03
75.0%
#11334: feat: add Mistral/Voxtral audio transcription provider
by JamesEBall · 2026-02-07
74.6%
#12717: fix: add "audio" to openai provider capabilities
by openjay · 2026-02-09
73.6%
#8388: fix(media): auto-skip tiny/empty audio files before transcription (...
by Glucksberg · 2026-02-04
72.9%
#20794: feat(tts): add Fish Audio provider with full docs, tests & gateway ...
by twangodev · 2026-02-19
72.9%
#8048: Media: add regression test for audio text blocks (#7970)
by Abhishek-B-R · 2026-02-03
72.5%
#8922: feat(voice-call): Add ElevenLabs WebSocket streaming TTS
by mikiships · 2026-02-04
71.0%
#19246: feat(media): add Google Vertex AI media provider
by ronaldslc · 2026-02-17
71.0%
#13389: feat(telegram): support native voice notes with automatic OGG/Opus ...
by leavingme · 2026-02-10
70.7%