← Back to PRs

#20878: fix: Widen models.input to accept "video" and "audio" modalities

by marcodelpin open 2026-02-19 11:17 View on GitHub →
commands agents size: XS
## Summary Fixes #20721 The `models.input` type union only accepted `"text"` and `"image"`, causing zod validation to reject `"video"` and `"audio"` values. This blocked users from declaring Gemini native multimodal capabilities through the config system, even though the runtime already has infrastructure for video/audio processing (`MAX_VIDEO_BYTES`, `MAX_AUDIO_BYTES`, `MediaKind`). ## Root Cause `ModelDefinitionSchema` in `zod-schema.core.ts` restricts the input array to: ```typescript z.array(z.union([z.literal("text"), z.literal("image")])) ``` This same `"text" | "image"` constraint was duplicated across 9 files (types, model catalog, provider discovery, etc.). ## Changes (9 files, +22/-16) **Config layer (user-facing):** - `src/config/zod-schema.core.ts` — widened zod union to include `"video"` and `"audio"` - `src/config/types.models.ts` — widened `ModelDefinitionConfig.input` type **Provider discovery:** - `src/agents/bedrock-discovery.ts` — `mapInputModalities()` now passes through `"video"` and `"audio"` from Bedrock model summaries - `src/agents/model-catalog.ts` — widened `ModelCatalogEntry` and `DiscoveredModel` input types - `src/agents/cloudflare-ai-gateway.ts` — widened param type - `src/agents/huggingface-models.ts` — widened local variable type - `src/commands/onboard-auth.config-litellm.ts` — widened return type **SDK boundary:** - `src/agents/model-scan.ts` — `parseModality()` kept narrow (`"text" | "image"`) for SDK `Model<Api>` compatibility; added comment explaining the constraint - `src/agents/pi-embedded-runner/model.test.ts` — loosened test helper type ## Backward Compatibility Fully backward compatible: - Existing configs without `"video"` / `"audio"` are unaffected - All downstream consumers only check `.includes("image")` — extra values pass through harmlessly - Display strings (e.g., `list.registry.ts:206`) use `.join("+")` which handles any value ## Test Plan - [x] 444 existing tests pass (1 pre-existing Windows symlink failure unrelated) - [x] TypeScript type check passes (no new errors) - [x] Zero runtime logic changes needed — consumers are agnostic to extra input values <!-- greptile_comment --> <h3>Greptile Summary</h3> Extends `models.input` type union from `"text" | "image"` to include `"video"` and `"audio"` modalities, enabling users to declare Gemini and other multimodal capabilities through the config system. The change is applied consistently across the type system (Zod schema, TypeScript types, provider discovery, model catalog). - Core schema and types updated in `zod-schema.core.ts` and `types.models.ts` - Provider discovery updated to pass through video/audio from Bedrock model summaries - SDK boundary (`model-scan.ts`) intentionally kept narrow for `Model<Api>` compatibility, with clear documentation - Fully backward compatible: existing configs unaffected, consumers only check `.includes("image")`, display uses `.join("+")` which handles any value - One issue found: HuggingFace discovery widened the type but didn't update the logic to detect video/audio from API responses <h3>Confidence Score: 4/5</h3> - Safe to merge with one fix: HuggingFace discovery logic needs updating - The PR is well-structured and backward compatible, with consistent type changes across 9 files. The approach is sound: widening the config layer to accept video/audio while keeping the SDK boundary narrow. However, one logical error was found in `huggingface-models.ts` where the type was widened but the detection logic wasn't updated to actually extract video/audio modalities from API responses. This is a straightforward fix that won't affect existing behavior. - Pay attention to `src/agents/huggingface-models.ts` - the logic needs updating to match the widened type <sub>Last reviewed commit: f8cb9f8</sub> <!-- greptile_other_comments_section --> <!-- /greptile_comment --> --- 🤖 Generated with [Claude Code](https://claude.com/claude-code) Testing: fully tested (automated + manual verification) I understand and can explain all changes in this PR.

Most Similar PRs