Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 13 additions & 8 deletions docs/training.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,7 +79,7 @@ uvx hf@latest download Wan-AI/Wan2.2-TI2V-5B Wan2.2_VAE.pth \

<details><summary><b>Reasoner Alignment SFT with LLaVA-OneVision (vfm-vlm)</b></summary>

Alignment SFT for the Reasoner variant on the [lmms-lab/LLaVA-OneVision-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data) dataset (streamed from HF Hub). Skips Step 2: the backbone is `Qwen/Qwen3-VL-8B-Instruct` (set by the parent experiment's `vlm_policy=qwen3_vl_8b_instruct` default) and is fetched from the HF Hub by the model downloader at startup — no DCP conversion needed and no env-var plumbing required.
Alignment SFT for the Reasoner variant on the [lmms-lab/LLaVA-OneVision-Data](https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data) dataset (streamed from HF Hub). Skips Step 2: by default the backbone `Qwen/Qwen3-VL-8B-Instruct` is fetched from the HF Hub by the model downloader at startup — no DCP conversion needed and no required env vars. To instead start from a merged Cosmos3 reasoner snapshot (Cosmos3-Nano LM merged onto the Qwen3-VL visual tower), build it with `convert_model_to_vlm_safetensors` (see [Step 2](#step-2--prepare-checkpoint)) and point `VLM_SAFETENSORS_PATH` at it — same mechanism as the VideoPhy-2 recipe below.

Launch shell: `examples/launch_sft_llava_ov.sh`

Expand All @@ -91,6 +91,11 @@ Launch shell: `examples/launch_sft_llava_ov.sh`
# (optional) HF_TOKEN raises HF Hub rate limits for the streamed dataset
# revision lookup — useful if you're running 8-rank fan-out from a single IP:
# export HF_TOKEN=hf_...
#
# (optional) VLM_SAFETENSORS_PATH starts training from a local pre-converted
# Qwen3-VL safetensors snapshot (e.g. Cosmos3-Nano LM merged with the Qwen3-VL
# visual tower) instead of the public HF backbone:
# export VLM_SAFETENSORS_PATH=$PWD/examples/checkpoints/Cosmos3-Nano-VLM
```

</details>
Expand Down Expand Up @@ -127,7 +132,7 @@ python -m cosmos_framework.scripts.convert_model_to_dcp \

`$BASE_CHECKPOINT_NAME` (e.g. `Cosmos3-Nano`, `Cosmos3-Super`) is a registered name in the checkpoint catalog; the converter downloads the matching repo from the Hugging Face Hub and writes the DCP into `examples/checkpoints/$BASE_CHECKPOINT_NAME`.

**Reasoner Alignment SFT with LLaVA-OneVision (vfm-vlm):** Skip this step — the Reasoner alignment SFT loads `Qwen/Qwen3-VL-8B-Instruct` from the HF Hub at startup (no DCP conversion, no env vars).
**Reasoner Alignment SFT with LLaVA-OneVision (vfm-vlm):** Skip this step — the Reasoner alignment SFT loads `Qwen/Qwen3-VL-8B-Instruct` from the HF Hub at startup (no DCP conversion required). To start from a merged Cosmos3 reasoner snapshot instead, build one with `convert_model_to_vlm_safetensors` (see the VideoPhy-2 note below) and pass it via `VLM_SAFETENSORS_PATH`.

**Reasoner Alignment SFT with VideoPhy-2 (Cosmos3-Nano):** Use `cosmos_framework.scripts.convert_model_to_vlm_safetensors` instead.

Expand All @@ -154,12 +159,12 @@ bash examples/launch_sft_vision_nano.sh

Each launcher's default paths come from the `DATASET_PATH` + `BASE_CHECKPOINT_PATH` defaults declared at the top of its `.sh` (each uses `: "${VAR:=…}"` so any value you `export` in the shell before launching wins over the default):

| Launch shell | Post-Training Task | Default $DATASET_PATH (under examples/data/) | Default $BASE_CHECKPOINT_PATH (under examples/checkpoints/) |
| ------------------------------ | ------------------ | ---------------------------------------------------------- | ----------------------------------------------------------- |
| `launch_sft_vision_nano.sh` | Generator SFT | `BridgeData2-Subset-Synthetic-Captions/sft_dataset_bridge` | `Cosmos3-Nano` |
| `launch_sft_vision_super.sh` | Generator SFT | `BridgeData2-Subset-Synthetic-Captions/sft_dataset_bridge` | `Cosmos3-Super` |
| `launch_sft_llava_ov.sh` | Reasoner SFT | (none; dataset streams from HF Hub) | (none; backbone fetched at startup) |
| `launch_sft_videophy2_nano.sh` | Reasoner SFT | (none; set `VIDEOPHYSICS_ROOT` env) | (none; set `VLM_SAFETENSORS_PATH` env) |
| Launch shell | Post-Training Task | Default $DATASET_PATH (under examples/data/) | Default $BASE_CHECKPOINT_PATH (under examples/checkpoints/) |
| ------------------------------ | ------------------ | ---------------------------------------------------------- | ------------------------------------------------------------------ |
| `launch_sft_vision_nano.sh` | Generator SFT | `BridgeData2-Subset-Synthetic-Captions/sft_dataset_bridge` | `Cosmos3-Nano` |
| `launch_sft_vision_super.sh` | Generator SFT | `BridgeData2-Subset-Synthetic-Captions/sft_dataset_bridge` | `Cosmos3-Super` |
| `launch_sft_llava_ov.sh` | Reasoner SFT | (none; dataset streams from HF Hub) | (none; backbone fetched at startup, or set `VLM_SAFETENSORS_PATH`) |
| `launch_sft_videophy2_nano.sh` | Reasoner SFT | (none; set `VIDEOPHYSICS_ROOT` env) | (none; set `VLM_SAFETENSORS_PATH` env) |

`WAN_VAE_PATH` defaults to `examples/checkpoints/wan22_vae/Wan2.2_VAE.pth` for every non-reasoner recipe.

Expand Down
26 changes: 21 additions & 5 deletions examples/launch_sft_llava_ov.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,16 +10,32 @@
# [job].task = "vlm" — picks cosmos_framework/configs/base/vlm/config.py as the base config.
#
# The dataset streams from the HuggingFace Hub, so DATASET_PATH /
# WAN_VAE_PATH / BASE_CHECKPOINT_PATH are NOT required; only HF_TOKEN may
# be needed for gated tokenizer downloads. Two model knobs that the
# SFTExperimentConfig dataclass does not model live in TAIL_OVERRIDES:
# WAN_VAE_PATH / BASE_CHECKPOINT_PATH are NOT required.
#
# model.config.policy.backbone.model_name=<HF or local path>
# data_setting.max_tokens=<int>
# Optional env:
# HF_TOKEN for gated Qwen3-VL-8B-Instruct downloads.
# VLM_SAFETENSORS_PATH local directory of pre-converted Qwen3-VL safetensors
# (e.g. a Cosmos3-Nano LM merged with Qwen3-VL visual via
# `cosmos_framework.scripts.convert_model_to_vlm_safetensors`).
# When set, plumbed to backbone.safetensors_path via a
# tail override. When unset, the framework falls back
# to the public Qwen/Qwen3-VL-8B-Instruct HF snapshot.
#
# Usage (8-GPU allocation, inside the training container, from the repo root):
# bash examples/launch_sft_llava_ov.sh

TOML_FILE="examples/toml/sft_config/llava_ov.toml"

TAIL_OVERRIDES=(
${EXTRA_TAIL_OVERRIDES:-}
)

# When VLM_SAFETENSORS_PATH is set, plumb it to backbone.safetensors_path so the
# framework loads weights from the local snapshot (e.g. a Cosmos3-Nano LM merged
# with Qwen3-VL visual via `cosmos_framework.scripts.convert_model_to_vlm_safetensors`)
# while keeping the public HF model_name for tokenizer/architecture discovery.
if [[ -n "${VLM_SAFETENSORS_PATH:-}" ]]; then
TAIL_OVERRIDES+=("model.config.policy.backbone.safetensors_path=$VLM_SAFETENSORS_PATH")
fi

source "$(dirname "${BASH_SOURCE[0]}")/_sft_launcher_common.sh"