Video action models are an appealing foundation for Vision–Language–Action systems because they can learn visual dynamics from large-scale video data and transfer this knowledge to downstream robot control. Yet current diffusion-based video predictors are trained with likelihood-surrogate objectives, which encourage globally plausible predictions without explicitly optimizing the precision-critical visual dynamics needed for manipulation. This objective mismatch often leads to subtle errors in object pose, spatial relations, and contact timing that can be amplified by downstream policies. We propose VAMPO, a post-training framework that directly improves visual dynamics in video action models through policy optimization. Our key idea is to formulate multi-step denoising as a sequential decision process and optimize the denoising policy with rewards defined over expert visual dynamics in latent space. To make this optimization practical, we introduce an Euler Hybrid sampler that injects stochasticity only at the first denoising step, enabling tractable low-variance policy-gradient estimation while preserving the coherence of the remaining denoising trajectory. We further combine this design with GRPO and a verifiable non-adversarial reward based on L1 distance and cosine similarity. Across diverse simulated and real-world manipulation tasks, VAMPO improves task-relevant visual dynamics, leading to better downstream action generation and stronger generalization.
- Inference and evaluation code on Calvin
- Reinforcement learning post-training code
conda create -n VAMPO python==3.10
conda activate VAMPO
# Install calvin as described in (https://github.com/mees/calvin).
git clone --recurse-submodules https://github.com/mees/calvin.git
$ export CALVIN_ROOT=$(pwd)/calvin
cd $CALVIN_ROOT
sh install.sh
# Install VAMPO requirements
cd ..
pip install -r requirements.txt| Ckpt name | Training type | Size |
|---|---|---|
| VPP_svd | Baseline SVD video model | ~8G |
| VAMPO_svd | SVD video model trained by our method | ~8G |
| VAMPO_policy | Action model trained on annoted calvin abc dataset | ~1G |
| clip-vit-base-patch32 | CLIP text encoder | ~600M |
First, you need to follow instructions in the officail calvin repo to install the calvin environments and download official calvin ABC-D dataset(about 500 G).
Next, download the VAMPO_svd video model and VAMPO_policy action model. Set the video_model_folder and action_model_folder to the folder where you save the model in the script.
bash scripts/eval_calvin.shIf you just want to try RL training in VAMPO, just download the checkpoint VPP_svd and skip this section.
(1) Since the video diffusion model are run in latent space of image encoder, we need to first extract the latent sapce of the video. This process will save GPU memory cost and reduce training time. Run step1_prepare_latent_data.py to prepare latent. The dataset format should be similar to video_dataset_instance.
You can directly download features for something-something-v2, bridge, rt1 and calvin from huggingface dataset:vpp_svd_latent
(2) After prepare the latent, you need to reset the following parameters in video_conf/train_svd.yaml: dataset_dir is the root path of datasets; dataset is different video dataset used for finetuning and connected with +; prob is the sample ratio of each dataset.
accelerate launch --main_process_port 29506 step1_train_svd.py --config video_conf/train_calvin_svd.yaml --pretrained_model_path ${path to svd-robot}After downloading finetuned SVD checkpoint or training it from scratch, you can modify the configs and start RL training !
bash scripts/train_calvin_svd_grpo.shImportant: We highly encourage you to check the video prediction results before policy learning, since the policy performance are highly depand on the video prediction quality. Some samples are automatically saved during training. You can also make more predcitions following the instructions in the video prediction section.
Set the argument video_model_path to the video model you finetuned, the argument root_data_dir to where Calvin-ABC dataset located, the argument text_encoder_path to path to clip-vit-base-patch32
accelerate launch step2_train_action_calvin.py --root_data_dir ${path to Calvin dataset} --video_model_path ${path to video model} --text_encoder_path ${path to clip}VAMPO is developed from Video prediction policy. We thank the authors for their efforts!
If you find this project useful in your research, please cite:
@article{ge2026vampo,
title={VAMPO: Policy Optimization for Improving Visual Dynamics in Video Action Models},
author={Ge, Zirui and Ding, Pengxiang and Yin, Baohua and Wang, Qishen and Xie, Zhiyong and Wang, Yemin and Wang, Jinbo and Li, Hengtao and Suo, Runze and Song, Wenxuan and Zhao, Han and Lyu, Shangke and Fan, Zhaoxin and Li, Haoang and Cheng, Ran and Chi, Cheng and Ge, Huibin and Luo, Yaozhi and Wang, Donglin},
journal={arXiv preprint arXiv:2603.19370},
year={2026}
}