#12597: voice-call: add Asterisk ARI provider + core STT

by w0s1nsk1 open 2026-02-09 10:49 View on GitHub →

channel: voice-call stale

Cluster: Voice Transcription Enhancements

AI-assisted PR. ## Problem We need a stable way to handle voice calls across multiple telephony worlds (SIP endpoints, SIP trunks, GSM gateways) while keeping **one consistent OpenClaw integration**. Asterisk is the natural telecom router, but **ARI (Stasis) has no built‑in STT**. That means call control alone isn’t enough if we want `call.speech` events and the same transcription semantics as other providers. RTP media can also be fragile (codec/PT/NAT), and without a deterministic setup you end up with “rings but silence.” ## Solution Add/refresh the **`asterisk-ari`** provider and split responsibilities cleanly: - **Asterisk ARI (Stasis)** for call control (channels, bridges, events, DTMF) - **ExternalMedia / UnicastRTP** for deterministic audio bridging - **OpenClaw core transcription** for STT (with auto‑fallback across configured engines) Key decision: **STT lives in core**, not inside the provider. This keeps behavior consistent across providers (events, VAD, fallback), regardless of the audio source. ## Functionality **1) Call handling (outbound/inbound)** - Outbound: originate → Stasis → mixing bridge → ExternalMedia → RTP → TTS playback - Inbound: Stasis entry → bridge → ExternalMedia → RTP → core STT → `call.speech` **2) Audio + codecs** - Uses `asteriskAri.codec` (no `format` field in schemas) - RTP payload type matches codec (PCMU=0 / PCMA=8) - μ‑law ↔ A‑law conversion supported - **Per‑call RTP sockets/ports** + deterministic media setup **3) STT via OpenClaw core** - In‑memory buffering → WAV → core transcription - VAD with dynamic noise floor + hangover + pre‑roll + backpressure - Emits standard `call.speech` events **4) DTMF** - `ChannelDtmfReceived` → `call.dtmf` **5) Cleanup + resilience** - Best‑effort ExternalMedia cleanup (idempotent) - Inbound reject: hangup by SIP channel id (best‑effort; channel may already be gone) ## Testing **Unit tests:** - npx -y vitest run extensions/voice-call/src/providers/asterisk-ari.test.ts - npx -y vitest run extensions/voice-call/src/providers/asterisk-ari/ari-client.test.ts - npx -y vitest run extensions/voice-call/src/providers/asterisk-ari/ari-media.test.ts **Manual checklist (summary):** - Asterisk config + Stasis app name matches `asteriskAri.app` - Outbound: call → TTS audible → `call.speech` - Inbound: Stasis route → greeting → `call.speech` - DTMF: digits → `call.dtmf` - STT/VAD: silence vs short utterances vs noise  <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> This PR adds an `asterisk-ari` voice-call provider that uses Asterisk ARI (Stasis) for call control plus deterministic RTP media bridging (ExternalMedia / UnicastRTP). It also introduces core audio transcription wiring via `src/media-understanding/transcribe.ts` so providers can feed buffers into the existing media-understanding runner and get standard `call.speech` events with the same fallback/decision semantics as other sources. On the voice-call side, config/schema is extended to include `asteriskAri` (baseUrl/credentials/app/rtpHost/rtpPort/codec/trunk), the runtime can instantiate and shut down the ARI provider (including websocket cleanup), and the CallManager gains an explicit `ensureInboundCall()` path to avoid inbound call record races and to support early inbound rejection by providerCallId. <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with minimal risk. - Reviewed the changes around voice-call config resolution/validation, CallManager inbound creation/rejection logic, provider shutdown hooks, and the new core transcription entrypoint; the previously reported race/leak/idempotency issues appear addressed in this head SHA, and no new deterministic runtime/type failures were found. - No files require special attention