← Back to PRs

#11965: feat(ui): add speech-to-text dictation to web chat via Deepgram Flux

by billgetman open 2026-02-08 16:29 View on GitHub →
docs app: web-ui gateway stale
## Summary - Add real-time speech-to-text dictation to the web chat compose area using Deepgram's Flux model - Gateway proxies browser audio to Deepgram, keeping API keys server-side - Feature auto-enables when `DEEPGRAM_API_KEY` env var is set — zero config otherwise ### Architecture ``` Browser mic → AudioWorklet (PCM 16kHz) → Gateway WS (/dictation) → Deepgram v2/listen → Transcripts → Textarea ``` ### What is Deepgram Flux? [Deepgram](https://deepgram.com) is a speech-to-text API provider (similar to Google Speech, AWS Transcribe). **Flux** is their conversational model with ~260ms end-of-turn detection — it knows when the speaker has finished a thought and signals `EndOfTurn`, which we use to auto-stop recording. ### How it works 1. **Gateway** (`server-dictation.ts`): WebSocket upgrade handler at `/dictation` that proxies raw PCM audio to Deepgram's streaming API and returns transcript JSON. Requires `DEEPGRAM_API_KEY` in the environment. 2. **Browser client** (`dictation.ts`): `DictationClient` class that captures mic audio via an `AudioWorklet` (16kHz mono PCM), streams it over the gateway WebSocket, and dispatches transcript callbacks. 3. **UI integration** (`app.ts`, `views/chat.ts`): Mic button in compose area, `Cmd/Ctrl+Shift+D` keyboard shortcut, recording visual indicators, mic permission modal, and textarea population from transcripts. ### Feature detection - Gateway advertises `dictation: true` in the hello response when `DEEPGRAM_API_KEY` is configured - Browser checks `navigator.mediaDevices` and `AudioWorklet` support - Mic button only appears when both sides are ready ### Difference from PR #10012 PR #10012 ("Webui voice") uses the browser-native `SpeechRecognition` API. This PR takes a different approach: | | #10012 (Browser native) | This PR (Deepgram Flux) | |---|---|---| | Engine | Browser `SpeechRecognition` | Deepgram Flux via gateway proxy | | Browser support | Chrome/Edge only | Any browser with `AudioWorklet` | | End-of-turn | Browser-dependent | ~260ms Flux detection | | API key | None needed | `DEEPGRAM_API_KEY` on gateway | | Privacy | Audio sent to browser vendor | Audio sent to Deepgram via gateway | ### No new dependencies All implementation uses built-in Web APIs (`AudioWorklet`, `MediaDevices`, `WebSocket`) and the existing gateway WebSocket infrastructure. No new npm packages. ## Files changed **Gateway (new + modified):** - `src/gateway/server-dictation.ts` — WebSocket proxy to Deepgram (NEW) - `src/gateway/server-dictation.test.ts` — tests (NEW) - `src/gateway/server-http.ts` — register upgrade handler - `src/gateway/server-runtime-state.ts` — create handler - `src/gateway/server.impl.ts` — add dictation logger - `src/gateway/server/ws-connection/message-handler.ts` — feature flag in hello **Browser client (new + modified):** - `ui/src/ui/dictation.ts` — browser dictation client (NEW) - `ui/src/ui/dictation.test.ts` — tests (NEW) - `ui/src/ui/audio-worklet-processor.ts` — AudioWorklet PCM capture (NEW) - `ui/src/ui/components/mic-permission-modal.ts` — permission modal (NEW) - `ui/src/ui/icons.ts` — mic SVG icon - `ui/src/styles/chat/dictation.css` — recording animations (NEW) - `ui/src/styles/chat.css` — import dictation styles - `ui/src/ui/gateway.ts` — dictation field in hello type **UI integration (modified):** - `ui/src/ui/app.ts` — state, handlers, Cmd+Shift+D shortcut - `ui/src/ui/app-gateway.ts` — feature detection on connect - `ui/src/ui/app-render.ts` — pass dictation props - `ui/src/ui/views/chat.ts` — mic button, recording UI **Docs:** - `docs/plans/2026-02-07-dictation-design.md` — design document - `docs/plans/2026-02-07-dictation-impl.md` — implementation plan - `CHANGELOG.md` — added entry ## Test plan - [x] `pnpm build` passes - [x] `pnpm check` (lint + format) passes - [x] `pnpm test` passes (249 tests, including new dictation tests) - [x] Manual test: mic button appears when `DEEPGRAM_API_KEY` is set - [x] Manual test: recording starts/stops via button and keyboard shortcut - [x] Manual test: transcribed text populates the compose textarea - [ ] Test without `DEEPGRAM_API_KEY` — mic button should not appear - [ ] Test in Firefox (AudioWorklet support) 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- greptile_comment --> <h2>Greptile Overview</h2> <h3>Greptile Summary</h3> Adds a new browser dictation client that captures 16kHz PCM via an AudioWorklet and streams it to a new gateway WebSocket upgrade endpoint (`/dictation/stream`), which proxies audio to Deepgram’s streaming API and forwards transcript JSON back to the UI. The chat compose view gains a mic button, keyboard shortcut, interim “Listening…” placeholder, and a permission-help modal; the gateway hello response now advertises `features.dictation` when `DEEPGRAM_API_KEY` is configured so the UI can feature-detect availability. <h3>Confidence Score: 3/5</h3> - Reasonably safe to merge once the two functional issues below are addressed. - Core wiring is straightforward and tests pass, but there are two real behavioral problems introduced: (1) gateway-side unbounded buffering of audio while waiting for Deepgram, which can cause memory growth on bad upstream connections, and (2) the UI currently renders the mic button even when dictation isn’t actually available because `dictationEnabled` defaults to undefined/true-ish rendering logic. - src/gateway/server-dictation.ts, ui/src/ui/views/chat.ts <!-- greptile_other_comments_section --> <sub>(2/5) Greptile learns from your feedback when you react with thumbs up/down!</sub> <!-- /greptile_comment -->

Most Similar PRs