#21298: fix(config): extend model input schema for video/audio modalities

by Alfa-ai-ccvs-tech open 2026-02-19 21:59 View on GitHub →

docs commands agents size: XL

Cluster: Input Schema Enhancements and Bug Fixes

## Summary - **Fixes gateway startup crash** when `openclaw.json` declares `"video"` or `"audio"` as input modalities (e.g. `gemini-3.1-pro-preview`) - **Extends Zod validation** from `"text" | "image"` to `"text" | "image" | "video" | "audio"` — purely additive, fully backward-compatible - **Adds `modelSupportsVideo()` / `modelSupportsAudio()` helpers** and native skip logic in the media-understanding runner (mirrors the existing image skip pattern) ## Root Cause `models.providers.google.models[1].input = ["text", "image", "video", "audio"]` in `~/.openclaw/openclaw.json` fails Zod validation at `src/config/zod-schema.core.ts:41` which only allows `"text" | "image"`. ## Changes (8 source files + 2 docs) ### Part 1: Extend input type union (7 files) | File | Change | |------|--------| | `src/config/zod-schema.core.ts` | Add `z.literal("video")`, `z.literal("audio")` to union | | `src/config/types.models.ts` | Widen `ModelDefinitionConfig.input` type | | `src/agents/model-catalog.ts` | Widen `ModelCatalogEntry.input` and `DiscoveredModel.input` | | `src/agents/model-scan.ts` | Extend `parseModality()` to detect video/audio | | `src/agents/huggingface-models.ts` | Detect video/audio in `architecture.input_modalities` | | `src/commands/onboard-auth.config-litellm.ts` | Widen local type annotation | | `src/agents/cloudflare-ai-gateway.ts` | Widen parameter type annotation | ### Part 2: Capability helpers (1 file) - `src/agents/model-catalog.ts` — Added `modelSupportsVideo()` and `modelSupportsAudio()` next to existing `modelSupportsVision()` ### Part 3: Runner skip logic (1 file) - `src/media-understanding/runner.ts` — When the primary model natively supports video/audio, skip the separate understanding pipeline (same pattern as existing image skip) ### Documentation - [`docs/MODIFICATION-CARDS-video-audio-input.md`](docs/MODIFICATION-CARDS-video-audio-input.md) — 9 detailed modification cards with before/after code, rationale, risk assessment - [`docs/TECH-GUIDE-video-audio-input.md`](docs/TECH-GUIDE-video-audio-input.md) — Full architecture overview, change layers, external dependency notes, verification steps ## Upstream Compatibility Fetched `upstream/main` (`6cdcb5904`, 2026-02-19). Of 8 modified source files, only `runner.ts` was also changed upstream — in different regions (imports + new function at lines 79-101 vs our skip blocks at lines 721-790). **No merge conflicts expected.** ## Verification - `npx tsc --noEmit` passes with **0 errors** (clean compilation) - All type widening is additive — existing `["text", "image"]` configs are unaffected - Default values remain `["text"]` everywhere ## Test plan - [ ] `npx tsc --noEmit` — no type errors - [ ] `node openclaw.mjs gateway run --port 18789` — clean startup with Gemini config including video/audio - [ ] `ss -tlnp | grep :18789` — confirm port is listening - [ ] `npx vitest run --config vitest.unit.config.ts` — no regressions - [ ] Verify existing `"text" + "image"` only configs still work unchanged 🤖 Generated with [Claude Code](https://claude.com/claude-code)  <h3>Greptile Summary</h3> This PR extends input modality validation from `"text" | "image"` to include `"video"` and `"audio"`, fixing gateway crashes when models declare these modalities in config. The changes are backward-compatible and follow existing patterns. **Key changes:** - Extends Zod schema and TypeScript types across 7 config/agent files - Adds `modelSupportsVideo()` and `modelSupportsAudio()` helper functions - Implements native skip logic in media-understanding runner (mirrors existing image skip pattern) - Includes comprehensive documentation in `docs/refactor/` **Observations:** - Core schema changes are clean and consistent - Runner skip blocks follow the existing `modelSupportsVision()` pattern correctly - Type narrowing in `model-scan.ts:483` may silently drop video/audio modalities for OpenRouter models - Most file additions are custom skills being restored after upstream reset (as noted in PR description) - Upstream changes in `runner.ts` are in different regions (imports/removed functions vs new skip blocks) - no merge conflicts expected <h3>Confidence Score: 4/5</h3> - Safe to merge with minor attention to type narrowing in model scanning - The PR implements a straightforward additive schema extension following established patterns. All changes are backward-compatible since video/audio are added to an existing union type. The media-understanding skip logic correctly mirrors the existing image handling. One type narrowing cast in `model-scan.ts` could silently drop modalities, but this affects OpenRouter model scanning only and likely needs external dependency updates. No breaking changes or runtime risks in the core functionality. - Pay attention to `src/agents/model-scan.ts` line 483 where type narrowing may drop video/audio modalities <sub>Last reviewed commit: 43936fa</sub>  <sub>(3/5) Reply to the agent's comments like "Can you suggest a fix for this @greptileai?" or ask follow-up questions!</sub>