#20738: Fix model input schema to accept audio and video modalities

by Clawborn open 2026-02-19 08:07 View on GitHub →

size: S trusted-contributor

Cluster: Input Schema Enhancements and Bug Fixes

## Problem `ModelDefinitionSchema.input` was limited to `["text", "image"]` in both the Zod schema and TypeScript type, so any config declaring `"audio"` or `"video"` as model inputs fails validation: ``` Invalid input at models.providers.google.models.0.input.2 — expected "text" or "image" ``` This blocks users from declaring native multimodal capabilities for providers like Gemini that support audio/video input. ## Fix Extend the union in `ModelDefinitionSchema.input` to include `"audio"` and `"video"`, and update the matching TypeScript type in `types.models.ts`. The runtime already has full support for these modalities: - `MAX_VIDEO_BYTES` / `MAX_AUDIO_BYTES` constants - `MediaUnderstandingCapabilitiesSchema` (line 405 in same file) already accepts `["image", "audio", "video"]` - `media-understanding/providers/google` declares `capabilities: ["image", "audio", "video"]` ## Tests 3 new test cases in `config-misc.test.ts` verifying text/image, audio/video acceptance, and rejection of unknown modalities. Fixes #20721  <h3>Greptile Summary</h3> Extends `ModelDefinitionSchema.input` to accept `"audio"` and `"video"` modalities in addition to the existing `"text"` and `"image"` values. The change unblocks users from declaring native multimodal capabilities for providers like Gemini that support audio/video input. - Updated Zod schema in `zod-schema.core.ts:41-44` to include audio and video literals in the union - Updated TypeScript type in `types.models.ts:31` to match the schema - Added comprehensive test coverage with 3 test cases validating text/image acceptance, audio/video acceptance, and rejection of unknown modalities The runtime already has full support for these modalities (`MAX_AUDIO_BYTES`, `MAX_VIDEO_BYTES` constants, `MediaUnderstandingCapabilitiesSchema` accepting all four modalities, and Google provider declaring `["image", "audio", "video"]` capabilities). The fix aligns the config schema with existing runtime capabilities. <h3>Confidence Score: 5/5</h3> - This PR is safe to merge with no risk - it's a simple schema extension that aligns with existing runtime capabilities. - The change is minimal (two lines updated across schema and types), well-tested (3 new test cases), and directly addresses a validation bug. The runtime already fully supports audio/video modalities through existing constants, providers, and capabilities schemas. No breaking changes or edge cases identified. - No files require special attention <sub>Last reviewed commit: bb9b6d0</sub>