#14208: feat(media): add AssemblyAI audio transcription provider

by jmoraispk open 2026-02-11 17:47 View on GitHub →

docs size: L

Cluster: Voice Transcription Enhancements

#### Summary Adds AssemblyAI as a new audio transcription provider for inbound media understanding. AssemblyAI offers competitive pricing (~$0.10-0.65/hr vs OpenAI Whisper's ~$0.36/hr), high accuracy, and features like automatic language detection and speaker diarization. This is the first async (upload → submit → poll) transcription provider in OpenClaw, complementing the existing synchronous providers (OpenAI, Groq, Deepgram, Google). 🦞 #### Use Cases - Users who prefer AssemblyAI's pricing or accuracy for audio transcription - Users who already have AssemblyAI API keys for other projects - Environments where AssemblyAI's language detection or specialized models are preferred - Auto-detection: when `ASSEMBLYAI_API_KEY` is present, AssemblyAI is available as an audio provider #### Behavior Changes - New provider `assemblyai` available for `media.audio.provider` config - `ASSEMBLYAI_API_KEY` is auto-detected in the provider key scan order (OpenAI → Groq → Deepgram → Google → **AssemblyAI**) - Default model: `best` (AssemblyAI's highest-accuracy model); also supports `nano` for faster/cheaper transcription - No changes to existing provider behavior or defaults #### Existing Functionality Check - [x] I searched the codebase for existing functionality. Searches performed: - Searched `src/media-understanding/providers/` for existing provider patterns (used Deepgram and OpenAI as implementation references) - Searched for `assemblyai` across the codebase — no prior references found - Reviewed `src/media-understanding/types.ts` for `AudioTranscriptionRequest`/`AudioTranscriptionResult` interfaces #### Tests - 6 unit tests in `src/media-understanding/providers/assemblyai/audio.test.ts` covering: - Full upload → submit → poll happy path - Multi-poll retry with exponential backoff - Transcription error status handling - Upload HTTP error handling - Language and custom model passthrough - Authorization header correctness - All tests use mocked `fetch` — no network calls required - Existing tests unaffected #### Files Changed | File | Change | |------|--------| | `src/media-understanding/providers/assemblyai/audio.ts` | Core provider: 3-step async flow (upload buffer → submit job → poll for result) | | `src/media-understanding/providers/assemblyai/index.ts` | Provider registration (id, capabilities, entry point) | | `src/media-understanding/providers/assemblyai/audio.test.ts` | 6 unit tests with mocked fetch | | `src/media-understanding/providers/index.ts` | Register `assemblyaiProvider` in PROVIDERS array | | `src/media-understanding/defaults.ts` | Add `assemblyai: "best"` to DEFAULT_AUDIO_MODELS, add to AUTO_AUDIO_KEY_PROVIDERS | | `docs/providers/assemblyai.md` | New doc page: setup, models, async flow explanation, pricing | | `docs/providers/index.md` | Add AssemblyAI to transcription providers list | | `docs/nodes/media-understanding.md` | Add AssemblyAI to provider matrix, auto-detection order, default models | | `docs/nodes/audio.md` | Add AssemblyAI to audio provider detection order | | `docs/reference/api-usage-costs.md` | Add AssemblyAI to audio providers list | #### Implementation Notes - **Async flow**: Unlike existing providers that use a single synchronous API call, AssemblyAI requires a 3-step flow: (1) upload audio buffer to get a temporary URL, (2) submit a transcription job, (3) poll until completion. Exponential backoff (1s → 1.5s → 2.25s, capped at 3s) within the caller's overall timeout budget. - **Timeout management**: The total `timeoutMs` is distributed across all three steps, with remaining time tracked via `Date.now()` deltas. - **Follows existing patterns**: Provider structure mirrors `deepgram/` and `openai/` — exports a `MediaUnderstandingProvider` with `id`, `capabilities`, and `transcribeAudio`. **Sign-Off** - Models used: Claude claude-4.6-opus - Submitter effort: Medium — new provider implementation + docs + tests, based on existing provider patterns - Agent notes: Conflict resolved during cherry-pick — `AUTO_AUDIO_KEY_PROVIDERS` was moved from inline in `runner.ts` to `defaults.ts` since the original commit; resolved by adding `"assemblyai"` to the new location in `defaults.ts`. NOTE: maintainers have branch access. Thanks!  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR adds a new `assemblyai` media-understanding provider for audio transcription, including an async 3-step flow (upload -> submit transcript -> poll until complete), registers it in the provider registry, wires it into default model selection and API-key auto-detection order, and documents usage. Unit tests cover the upload/submit/poll behavior and some error paths using mocked `fetch`. Main integration points are `src/media-understanding/providers/index.ts` (provider registration) and `src/media-understanding/defaults.ts` (default model + auto key scan order), with the core implementation in `src/media-understanding/providers/assemblyai/audio.ts` using the shared SSRF-guarded fetch helper. <h3>Confidence Score: 4/5</h3> - This PR is close to mergeable, but has a real timeout-budget bug that can cause immediate aborts instead of clean timeouts. - Core provider wiring and SSRF-guarded fetch usage look consistent with existing patterns, and tests cover the happy path and some failures. The remaining issue is that remaining timeout math can go non-positive and is passed into the abort-timer, which deterministically causes immediate abort errors under small timeout budgets or slow upstream responses. - src/media-understanding/providers/assemblyai/audio.ts