#22402: Add runtime.stt.transcribeAudioFile for plugin STT access

by benthecarman open 2026-02-21 03:53 View on GitHub →

channel: bluebubbles size: S

Cluster: Voice Transcription Enhancements

## Summary - Add `runtime.stt.transcribeAudioFile()` to `PluginRuntime` so external plugins can use openclaw's media-understanding provider framework for speech-to-text - New `src/media-understanding/transcribe-audio.ts` wraps `runCapability({capability: "audio"})` — same pattern as the Discord VC implementation in #18774 - Reads provider/model/apiKey from `tools.media.audio` in the config, with automatic provider fallback ## Motivation The marmot plugin needs to transcribe call audio chunks but can't import internal media-understanding modules (`ERR_PACKAGE_PATH_NOT_EXPORTED`). This mirrors how `runtime.tts.textToSpeechTelephony` already exposes TTS to plugins. ## Usage (from a plugin) ```typescript const result = await runtime.stt.transcribeAudioFile({ filePath: "/tmp/audio-chunk.wav", cfg: runtime.config.loadConfig(), }); if (result.text) { // dispatch transcript to agent } ``` ## Test plan - [ ] TypeScript compiles - [ ] Existing media-understanding tests still pass - [ ] Marmot plugin can call `runtime.stt.transcribeAudioFile()` after openclaw is rebuilt 🤖 Generated with [Claude Code](https://claude.com/claude-code)  <h3>Greptile Summary</h3> Adds `runtime.stt.transcribeAudioFile()` to expose speech-to-text functionality to external plugins. The implementation follows the same pattern as the Discord voice manager's `transcribeAudio()` function, wrapping `runCapability({capability: "audio"})` from the media-understanding framework. Key changes: - New `src/media-understanding/transcribe-audio.ts` provides a standalone wrapper function - Function exported via `PluginRuntime.stt.transcribeAudioFile` - Uses same provider/model/apiKey resolution from `tools.media.audio` config - Properly handles cleanup via `cache.cleanup()` in finally block The implementation is clean and matches established patterns in the codebase. <h3>Confidence Score: 5/5</h3> - Safe to merge - straightforward implementation following existing patterns - The implementation directly mirrors the proven Discord voice manager pattern, properly handles resource cleanup, and uses the existing media-understanding provider framework without introducing new dependencies or risks. The only minor suggestion is around MIME type flexibility. - No files require special attention <sub>Last reviewed commit: 70009ce</sub>  <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub>