#14208: feat(media): add AssemblyAI audio transcription provider
docs
size: L
Cluster:
Voice Transcription Enhancements
#### Summary
Adds AssemblyAI as a new audio transcription provider for inbound media understanding. AssemblyAI offers competitive pricing (~$0.10-0.65/hr vs OpenAI Whisper's ~$0.36/hr), high accuracy, and features like automatic language detection and speaker diarization.
This is the first async (upload → submit → poll) transcription provider in OpenClaw, complementing the existing synchronous providers (OpenAI, Groq, Deepgram, Google).
🦞
#### Use Cases
- Users who prefer AssemblyAI's pricing or accuracy for audio transcription
- Users who already have AssemblyAI API keys for other projects
- Environments where AssemblyAI's language detection or specialized models are preferred
- Auto-detection: when `ASSEMBLYAI_API_KEY` is present, AssemblyAI is available as an audio provider
#### Behavior Changes
- New provider `assemblyai` available for `media.audio.provider` config
- `ASSEMBLYAI_API_KEY` is auto-detected in the provider key scan order (OpenAI → Groq → Deepgram → Google → **AssemblyAI**)
- Default model: `best` (AssemblyAI's highest-accuracy model); also supports `nano` for faster/cheaper transcription
- No changes to existing provider behavior or defaults
#### Existing Functionality Check
- [x] I searched the codebase for existing functionality.
Searches performed:
- Searched `src/media-understanding/providers/` for existing provider patterns (used Deepgram and OpenAI as implementation references)
- Searched for `assemblyai` across the codebase — no prior references found
- Reviewed `src/media-understanding/types.ts` for `AudioTranscriptionRequest`/`AudioTranscriptionResult` interfaces
#### Tests
- 6 unit tests in `src/media-understanding/providers/assemblyai/audio.test.ts` covering:
- Full upload → submit → poll happy path
- Multi-poll retry with exponential backoff
- Transcription error status handling
- Upload HTTP error handling
- Language and custom model passthrough
- Authorization header correctness
- All tests use mocked `fetch` — no network calls required
- Existing tests unaffected
#### Files Changed
| File | Change |
|------|--------|
| `src/media-understanding/providers/assemblyai/audio.ts` | Core provider: 3-step async flow (upload buffer → submit job → poll for result) |
| `src/media-understanding/providers/assemblyai/index.ts` | Provider registration (id, capabilities, entry point) |
| `src/media-understanding/providers/assemblyai/audio.test.ts` | 6 unit tests with mocked fetch |
| `src/media-understanding/providers/index.ts` | Register `assemblyaiProvider` in PROVIDERS array |
| `src/media-understanding/defaults.ts` | Add `assemblyai: "best"` to DEFAULT_AUDIO_MODELS, add to AUTO_AUDIO_KEY_PROVIDERS |
| `docs/providers/assemblyai.md` | New doc page: setup, models, async flow explanation, pricing |
| `docs/providers/index.md` | Add AssemblyAI to transcription providers list |
| `docs/nodes/media-understanding.md` | Add AssemblyAI to provider matrix, auto-detection order, default models |
| `docs/nodes/audio.md` | Add AssemblyAI to audio provider detection order |
| `docs/reference/api-usage-costs.md` | Add AssemblyAI to audio providers list |
#### Implementation Notes
- **Async flow**: Unlike existing providers that use a single synchronous API call, AssemblyAI requires a 3-step flow: (1) upload audio buffer to get a temporary URL, (2) submit a transcription job, (3) poll until completion. Exponential backoff (1s → 1.5s → 2.25s, capped at 3s) within the caller's overall timeout budget.
- **Timeout management**: The total `timeoutMs` is distributed across all three steps, with remaining time tracked via `Date.now()` deltas.
- **Follows existing patterns**: Provider structure mirrors `deepgram/` and `openai/` — exports a `MediaUnderstandingProvider` with `id`, `capabilities`, and `transcribeAudio`.
**Sign-Off**
- Models used: Claude claude-4.6-opus
- Submitter effort: Medium — new provider implementation + docs + tests, based on existing provider patterns
- Agent notes: Conflict resolved during cherry-pick — `AUTO_AUDIO_KEY_PROVIDERS` was moved from inline in `runner.ts` to `defaults.ts` since the original commit; resolved by adding `"assemblyai"` to the new location in `defaults.ts`.
NOTE: maintainers have branch access. Thanks!
<!-- greptile_comment -->
<h2>Greptile Overview</h2>
<h3>Greptile Summary</h3>
This PR adds a new `assemblyai` media-understanding provider for audio transcription, including an async 3-step flow (upload -> submit transcript -> poll until complete), registers it in the provider registry, wires it into default model selection and API-key auto-detection order, and documents usage. Unit tests cover the upload/submit/poll behavior and some error paths using mocked `fetch`.
Main integration points are `src/media-understanding/providers/index.ts` (provider registration) and `src/media-understanding/defaults.ts` (default model + auto key scan order), with the core implementation in `src/media-understanding/providers/assemblyai/audio.ts` using the shared SSRF-guarded fetch helper.
<h3>Confidence Score: 4/5</h3>
- This PR is close to mergeable, but has a real timeout-budget bug that can cause immediate aborts instead of clean timeouts.
- Core provider wiring and SSRF-guarded fetch usage look consistent with existing patterns, and tests cover the happy path and some failures. The remaining issue is that remaining timeout math can go non-positive and is passed into the abort-timer, which deterministically causes immediate abort errors under small timeout budgets or slow upstream responses.
- src/media-understanding/providers/assemblyai/audio.ts
<!-- greptile_other_comments_section -->
<!-- /greptile_comment -->
Most Similar PRs
#19427: feat: add Soniox speech-to-text provider
by matjaz · 2026-02-17
82.2%
#11334: feat: add Mistral/Voxtral audio transcription provider
by JamesEBall · 2026-02-07
76.6%
#19246: feat(media): add Google Vertex AI media provider
by ronaldslc · 2026-02-17
75.0%
#12717: fix: add "audio" to openai provider capabilities
by openjay · 2026-02-09
74.7%
#8388: fix(media): auto-skip tiny/empty audio files before transcription (...
by Glucksberg · 2026-02-04
73.6%
#8848: feat(stt): Add Whisper as first-class audio transcription provider
by emadomedher · 2026-02-04
73.6%
#12597: voice-call: add Asterisk ARI provider + core STT
by w0s1nsk1 · 2026-02-09
73.5%
#7965: feat(tts): add Speechify as TTS provider
by chaerla · 2026-02-03
73.2%
#12020: feat: add AIsa provider, production-grade Chinese AI models
by renning22 · 2026-02-08
72.1%
#14239: Add Azure OpenAI Completions provider
by KJFromMicromonic · 2026-02-11
72.0%