#23572: feat(voice): enable voice note conversation loop for Telegram and WhatsApp

by davidrudduck open 2026-02-22 13:10 View on GitHub →

channel: telegram channel: whatsapp-web agents size: XL trusted-contributor

## Summary Enables a full voice-note conversation loop for Telegram and WhatsApp: user sends a voice note → STT transcription → agent processes text → TTS reply → voice note response. Most of the pipeline already existed (STT via `transcribeFirstAudio`, TTS via `maybeApplyTtsToPayload`, inbound audio detection, Telegram voice sending, WhatsApp PTT delivery). This PR makes three surgical changes to wire up the missing connections. ## Key Changes ### 1. Telegram: Broaden preflight transcription gate **`src/telegram/bot-message-context.ts`** Previously, voice note transcription only ran in groups with `requireMention: true` (for mention detection). Voice notes in DMs and unrestricted groups arrived as raw `<media:audio>` placeholders and were never transcribed. Now transcription runs whenever `tools.media.audio.enabled` is configured, covering DMs and all group types. The existing mention-detection path is preserved as an OR condition. ### 2. WhatsApp: Add voice note transcription **`src/web/inbound/monitor.ts`** WhatsApp voice messages were never transcribed — the body was set to `<media:audio>` and never replaced. Added preflight transcription after media download: when the body is `<media:audio>` and the media is audio, `transcribeFirstAudio()` replaces the placeholder with spoken text. On failure, falls back gracefully to the existing `<media:audio>` behavior. ### 3. TTS: WhatsApp Opus output format + voice flag **`src/tts/tts.ts`** - Added `WHATSAPP_OUTPUT` format constant (Opus @48kHz/64kbps, matching Telegram's optimal voice note format) - `resolveOutputFormat()` now returns Opus for the WhatsApp channel (was defaulting to MP3) - `audioAsVoice` flag now set for both Telegram and WhatsApp, so TTS audio is delivered as a playable voice note (PTT) rather than a file attachment ## Configuration No new config flags. Uses two existing settings: | Setting | Purpose | |---------|---------| | `tools.media.audio.enabled: true` | Enables STT transcription of voice notes | | `messages.tts.auto: "inbound"` | Enables TTS reply only when user sent audio | ## Testing - **Build**: `pnpm build` — clean, no type errors - **TTS tests**: `pnpm vitest run src/tts/tts.test.ts` — 19/19 passing - **Lint**: 0 warnings, 0 errors ## Files Changed | File | Changes | |------|---------| | `src/telegram/bot-message-context.ts` | Broadened `needsPreflightTranscription` condition (+4/-1) | | `src/web/inbound/monitor.ts` | Added voice note transcription block after media download (+29) | | `src/tts/tts.ts` | Added `WHATSAPP_OUTPUT` constant, WhatsApp branch in `resolveOutputFormat`, expanded `shouldVoice` check (+12/-1) |  <h3>Greptile Summary</h3> This PR combines 4 separate features/changes into one pull request: 1. **Voice note conversation loop** (HEAD commit 05289873): Enables bidirectional voice communication for Telegram and WhatsApp by connecting existing STT and TTS infrastructure 2. **Dynamic model router** (commit 024a42b9): Adds complexity-based LLM routing to automatically select models based on conversation complexity 3. **Plugin hooks for context manipulation** (commit 223ffce3): Adds `before_context_send` and enhanced `before_prompt_build` hooks for plugins 4. **PostgreSQL dependencies** (commit 80edf84f): Adds `pg` and `@types/pg` packages for Monday extension The voice note changes (title feature) are clean and minimal - they wire up preflight transcription for both platforms and configure TTS to use Opus format with voice note delivery. The implementation correctly reuses existing transcription infrastructure and follows the codebase patterns. **However**, bundling 4 unrelated features in a single PR makes it difficult to review, test, and potentially revert individual changes. Per the repository's commit guidelines (AGENTS.md line 106: "Group related changes; avoid bundling unrelated refactors"), these should be separate PRs. <h3>Confidence Score: 4/5</h3> - This PR is relatively safe to merge with minor organizational concerns - The voice note implementation is clean and follows existing patterns correctly. The model router and plugin hooks have comprehensive test coverage (410 lines of new tests). All changes pass existing tests. The main concern is organizational - bundling 4 unrelated features makes it harder to isolate issues if they arise, but the code quality itself is solid. - No files require special attention - all changes follow existing patterns and have appropriate error handling <sub>Last reviewed commit: 0528987</sub>  **Context used:** - Context from `dashboard` - CLAUDE.md ([source](https://app.greptile.com/review/custom-context?memory=fd949e91-5c3a-4ab5-90a1-cbe184fd6ce8))