Hi, thank you for open-sourcing this great project!
I have a question regarding the post-training/SFT support for the Cosmos 3 Reasoner.
In the previous Cosmos Reason2, there were guidelines on performing LoRA SFT using trl and cosmos-rl. For Cosmos 3 Reasoner, I noticed that SFT is now supported through the cosmos-framework.
While reviewing the training documentation (https://github.com/NVIDIA/cosmos-framework/blob/main/docs/training.md), I had a question about the starting weights used in the examples:
- In the "Reasoner Alignment SFT with LLaVA-OneVision (vfm-vlm)" example, the backbone used is
Qwen/Qwen3-VL-8B-Instruct. Why does this process not start directly from the Cosmos 3 Reasoner weights? Additionally, what does "vfm-vlm" at the end of this example's title stand for/mean?
- In contrast, the "Reasoner Alignment SFT with VideoPhy-2 (Cosmos3-Nano)" example seems to start with the Cosmos 3 Nano weights. Could you please explain the key differences between these two examples and the reasoning behind using different starting weights for them?
Additionally, in this setup, Qwen's vision encoder is frozen and only the LM of Cosmos3 is used—could you share the reasoning behind this design choice?
- Lastly, similar to previous Cosmos models, are there plans to release recipes utilizing the
cosmos-framework in the cosmos-cookbook?
Thank you so much for your time and support!
Hi, thank you for open-sourcing this great project!
I have a question regarding the post-training/SFT support for the Cosmos 3 Reasoner.
In the previous Cosmos Reason2, there were guidelines on performing LoRA SFT using
trlandcosmos-rl. For Cosmos 3 Reasoner, I noticed that SFT is now supported through thecosmos-framework.While reviewing the training documentation (https://github.com/NVIDIA/cosmos-framework/blob/main/docs/training.md), I had a question about the starting weights used in the examples:
Qwen/Qwen3-VL-8B-Instruct. Why does this process not start directly from the Cosmos 3 Reasoner weights? Additionally, what does "vfm-vlm" at the end of this example's title stand for/mean?Additionally, in this setup, Qwen's vision encoder is frozen and only the LM of Cosmos3 is used—could you share the reasoning behind this design choice?
cosmos-frameworkin thecosmos-cookbook?Thank you so much for your time and support!