From 985d80332599ba0f230c150aa067275fe0c4d01b Mon Sep 17 00:00:00 2001 From: Maosheng Liao Date: Sun, 7 Jun 2026 22:39:37 -0700 Subject: [PATCH 01/20] Add design spec: video input for reasoner model-mode inference Co-Authored-By: Claude Opus 4.8 (1M context) --- .../2026-06-07-video-reasoner-input-design.md | 257 ++++++++++++++++++ 1 file changed, 257 insertions(+) create mode 100644 docs/superpowers/specs/2026-06-07-video-reasoner-input-design.md diff --git a/docs/superpowers/specs/2026-06-07-video-reasoner-input-design.md b/docs/superpowers/specs/2026-06-07-video-reasoner-input-design.md new file mode 100644 index 0000000..60b82da --- /dev/null +++ b/docs/superpowers/specs/2026-06-07-video-reasoner-input-design.md @@ -0,0 +1,257 @@ +# Video input for the `reasoner` model-mode of inference — design + +**Date:** 2026-06-07 +**Branch:** `maoshengl/video_reasoner_inference` +**Status:** approved design, ready for implementation plan + +## Goal + +Let `model_mode=reasoner` in the Cosmos inference engine +(`python -m cosmos_framework.scripts.inference`) accept a **local mp4 video** +as conditioning input, producing text that reasons over the clip — for both +`Cosmos3-Nano` and `Cosmos3-Super`. Today the reasoner accepts only a text +prompt or a single still image. + +## Background: why this is a gap + +The reasoner text-generation path runs entirely inside the Cosmos engine: + +``` +inference.py:_get_reasoner_sample_data # loads ONE PIL image via Image.open + -> OmniMoTModel.generate_reasoner_text # builds {"type":"image",...} chat block + -> net.generate_reasoner_text # pass-through + -> unified_mot._impl_generate_reasoner_text # pixel_values + image_grid_thw only + -> prepare_multimodal_reasoner_inputs # image recipe only +``` + +Two hard blocks: + +1. `_get_reasoner_sample_data` (`cosmos_framework/inference/inference.py`) calls + `Image.open(vision_path)` unconditionally — PIL cannot decode mp4. +2. `_impl_generate_reasoner_text` and `prepare_multimodal_reasoner_inputs` + **explicitly reject video** ("for I2V conditioning, frames must be passed as + images" — they have no `pixel_values_videos` / `video_grid_thw` params). + +Separately, `cosmos_framework/scripts/vlm/eval_videophy2.py` *does* consume +video, but through a **different, standalone path**: a raw HuggingFace +`Qwen3VLForConditionalGeneration` + `processor.apply_chat_template([{"type": +"video",...}])` + `model.generate()`. It never touches the Cosmos engine, so it +does not satisfy the goal of supporting `model_mode=reasoner` in +`scripts.inference`. + +**Key enabling fact:** the vendored Qwen3-VL model under +`cosmos_framework/model/vfm/vlm/qwen3_vl/` already implements video end to end — +`get_video_features`, `get_rope_index(video_grid_thw=...)`, +`get_placeholder_mask(pixel_values_videos=...)`, a `video_token_id`, and a full +`video_processing_qwen3_vl.py`. Only the Cosmos reasoner **wrapper layers** are +hardcoded to images. So the change is additive plumbing, not new model logic. + +## Approach (chosen) + +**B1 — add a parallel video lane through the existing reasoner stack.** + +Add optional video parameters alongside the existing image parameters through +the wrapper layers, leaving the image and text-only paths bit-identical. A given +prompt carries **either** an image, **or** a video, **or** neither — never both. +No mixed image+video support (not needed). + +Approaches considered and rejected: + +- **B2 — unify image+video into one "media item" abstraction.** Cleaner + long-term and enables mixed media in one prompt, but larger blast radius, more + validation/tests, and supports a capability not requested (YAGNI). +- **B3 — expose the HF `Qwen3VLForConditionalGeneration` route instead.** Bypasses + the Cosmos engine entirely (no `model_mode=reasoner`, no parallelism / + guardrails / output plumbing) — does not meet the goal. + +## Data flow + +``` +inputs/reasoner/reasoner_video.json + { model_mode: "reasoner", prompt, vision_path: "clip.mp4", video_*: ... } + | + v args.py: vision_path resolves; extension -> ConditionVisionMode.VIDEO (already detected) +_get_reasoner_sample_data() + | detect .mp4 -> {prompt, "reasoner_videos": [path], "