Question: Support for Cosmos 3 Reasoner Post-training

Hi, thank you for open-sourcing this great project!

I have a question regarding the post-training/SFT support for the Cosmos 3 Reasoner.

In the previous Cosmos Reason2, there were guidelines on performing LoRA SFT using `trl` and `cosmos-rl`. For Cosmos 3 Reasoner, I noticed that SFT is now supported through the `cosmos-framework`.

While reviewing the training documentation (https://github.com/NVIDIA/cosmos-framework/blob/main/docs/training.md), I had a question about the starting weights used in the examples:

1. In the **"Reasoner Alignment SFT with LLaVA-OneVision (vfm-vlm)"** example, the backbone used is `Qwen/Qwen3-VL-8B-Instruct`. Why does this process not start directly from the Cosmos 3 Reasoner weights? Additionally, what does "**vfm-vlm**" at the end of this example's title stand for/mean?
2. In contrast, the **"Reasoner Alignment SFT with VideoPhy-2 (Cosmos3-Nano)"** example seems to start with the Cosmos 3 Nano weights. Could you please explain the key differences between these two examples and the reasoning behind using different starting weights for them? 
Additionally, in this setup, Qwen's vision encoder is frozen and only the LM of Cosmos3 is used—could you share the reasoning behind this design choice?
3. Lastly, similar to previous Cosmos models, are there plans to release recipes utilizing the `cosmos-framework` in the `cosmos-cookbook`?

Thank you so much for your time and support!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Support for Cosmos 3 Reasoner Post-training #38

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question: Support for Cosmos 3 Reasoner Post-training #38

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions