Vision-Language-Action (VLA) models often struggle with precise spatial grounding and robustness due to monolithic end-to-end designs. In this project, we introduce that decouples high-level reasoning and low-level execution via a structured visual prompting interface, enabling more precise and reliable robotic manipulation.
VP-VLA_supp_video.mp4
VP-VLA demonstrates the following features:
-
Dual-System Architecture: VP-VLA decomposes robotic manipulation into:
- System 2 Planner (high-level reasoning)
- System 1 Controller (low-level execution)
-
Visual Prompt Interface: Instead of relying solely on text, VP-VLA converts language instructions into structured visual prompts (crosshairs and bounding boxes), enabling precise spatial grounding.
-
Improved Spatial Precision & Robustness: By grounding actions in visual space, the framework significantly improves performance in:
- Novel object scenarios
- Out-of-distribution (OOD) spatial configurations
-
General Multi-Stage Manipulation: VP-VLA supports complex, multi-step tasks via:
- Task decomposition
- Event-driven planning
- Dynamic visual prompt updates
[Apr 12th, 2026] π₯ Code released!
[Mar 24th, 2026] π₯ π Paper released!
Instead of solving everything in one forward pass, VP-VLA does the following:
- Language β Visual Prompts β Actions
This transforms the problem into visuomotor tracking of explicit spatial cues, improving precision and interpretability.
First, prepare the training environment. We follow the same installation from starVLA:
git clone https://github.com/JIA-Lab-research/VP-VLA.git
cd VP-VLA
# Create conda environment
conda create -n starVLA python=3.10 -y
conda activate starVLA
# Install requirements
pip install -r requirements.txt
# Install FlashAttention2
pip install flash-attn --no-build-isolation
# Install StarVLA
pip install -e .Next, construct the evaluation environment:
- Robocasa-Tabletop
Please first follow the RoboCasa installation guide in
starVLAto install the base robocasa environment.
Note: Please install the robosuite package with robosuite==1.5.1. Install it with:
pip install robosuite==1.5.1-
SimplerEnv Please first follow the SimplerEnv installation guide in
starVLAto install the SimplerEnv environment. -
SAM Environment Please follow the SAM3 installation guide in
sam3to install the SAM 3 environment. Additionally, installtransformersto support loading the VLM:pip install git+https://github.com/huggingface/transformers
-
Download the base VLM model: Qwen3-VL-4B-Instruct β place it under
./playground/Pretrained_models/Qwen3-VL-4B-Instruct. -
Download the SAM 3 checkpoint. Remember to change the default path in
examples/Robocasa_tabletop/visual_prompt_utility/sam3_server.py
VP-VLA evaluation requires three concurrent services β a Policy Server (the trained VLA model), a SAM3 Server (for visual prompt generation), and a VLM Server (for task decomposition and subtask detection) β plus the environment-specific evaluation entry point.
The evaluation relies on three separate conda environments:
| Environment | Purpose |
|---|---|
starVLA |
Policy server |
sam3 |
SAM3 + VLM servers |
simpler_env (SimplerEnv only) |
Simulation environment |
robocasa (Robocasa only) |
Simulation environment |
cd VP-VLA
# Edit the python paths and SimplerEnv path in the script first
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash examples/SimplerEnv/eval_files/auto_eval_scripts/run_eval.sh /path/to/checkpoint.ptThe script automatically:
- Launches policy / SAM3 / VLM servers on each GPU
- Runs all bridge tasks with visual prompting
- Collects results and saves overlay videos
Modify the following paths at the top of the script:
star_vla_python: path to starVLA conda pythonsim_python: path to simpler_env conda pythonsam3_python: path to SAM3 conda pythonSimplerEnv_PATH: path to the SimplerEnv repository that you cloned
cd VP-VLA
# Edit the python paths in the script first
bash examples/Robocasa_tabletop/eval_files/run_eval.sh /path/to/checkpoint.pt The script automatically:
- Launches policy / SAM3 / VLM servers (one per GPU)
- Dispatches 24 tabletop environments across GPUs
- Monitors for crashes and re-queues failed evaluations (up to 5 retries)
Modify the following paths at the top of the script:
starVLA_PYTHON: path to starVLA conda pythonROBOCASA_PYTHON: path to robocasa conda pythonSAM3_PYTHON: path to SAM3 conda python
VP-VLA requires pre-computed visual prompt data for training. The data preparation pipeline uses VLM (for subtask decomposition and target identification) and SAM3 (for segmentation-based visual prompt generation) to process each episode in the dataset. The below scripts require pre-extract the frames from the dataset. The pre-extracted frames will also be used during training for visual prompt prediction.
cd VP-VLA
# Edit the paths at the top of the script first (DATASET_ROOT, FRAMES_ROOT, OUTPUT_DIR, SAM_MODEL_PATH, VLM_MODEL_PATH)
bash data_preparation/Robocasa_tabletop/run_parallel_processing.sh [NUM_GPUS] [SAM_SERVERS_PER_GPU] [VLM_SERVERS_PER_GPU] [WORKERS_PER_TASK]The script:
- Launches multiple SAM3 and VLM servers across GPUs
- Distributes episode processing across worker processes, each paired with a dedicated SAM+VLM server
- Outputs
.npzfiles containing visual prompt overlays for each episode
Required paths to configure:
DATASET_ROOT: path to the Robocasa dataset (LeRobot format)FRAMES_ROOT: path to pre-extracted JPEG framesOUTPUT_DIR: output directory for visual prompt.npzfilesSAM_MODEL_PATH: path to the SAM3 checkpointVLM_MODEL_PATH: path to the Qwen3-VL-4B-Instruct checkpoint
cd VP-VLA
# Edit the paths at the top of the script first (DATASETS_ROOT, OUTPUT_DIR, SAM_MODEL_PATH, VLM_MODEL_PATH)
bash data_preparation/SimplerEnv/run_parallel_processing.sh [NUM_GPUS] [SAM_SERVERS_PER_GPU] [VLM_SERVERS_PER_GPU]The script processes bridge_orig_lerobot and fractal20220817_data_lerobot datasets with the same parallel SAM+VLM server architecture.
Required paths to configure:
DATASETS_ROOT: path to the OXE datasets directoryOUTPUT_DIR: output directory for visual prompt.npzfilesSAM_MODEL_PATH: path to the SAM3 checkpointVLM_MODEL_PATH: path to the Qwen3-VL-4B-Instruct checkpoint
VP-VLA's System 1 Controller is trained with two concurrent objectives:
- VLA action prediction β predicts continuous robot actions from observations with visual prompt overlays
- VP location prediction β predicts visual prompt coordinates (crosshair center, bounding box) from the overlayed image
cd VP-VLA
bash examples/Robocasa_tabletop/train_files/run_train.shBefore running, modify the following paths in the script:
base_vlm: path to the Qwen3-VL checkpointdata_root_dir: path to the RoboCasa dataset (LeRobot format)visual_prompt_dir: path to the pre-computed visual prompt dataextracted_frames_dir: path to the pre-extracted frame images for VP prediction
Config file: examples/Robocasa_tabletop/train_files/starvla_cotrain_robocasa_visual_prompt.yaml
cd VP-VLA
bash examples/SimplerEnv/train_files/run_train.shBefore running, modify the following paths in the script:
base_vlm: path to the Qwen3-VL checkpointdata_root_dir: path to the OXE dataset (LeRobot format)visual_prompt_dir: path to the pre-computed visual prompt dataextracted_frames_dir: path to the pre-extracted frame images for VP prediction
Config file: examples/SimplerEnv/train_files/starvla_cotrain_oxe_visual_prompt.yaml
| Argument | Description |
|---|---|
--framework.name |
Model framework (default: QwenOFT) |
--framework.qwenvl.base_vlm |
Path to base VLM checkpoint |
--datasets.vla_data.data_mix |
Dataset mixture name |
--datasets.vla_data.feed_both_images true |
Feed both original and overlayed images to the VLA |
--trainer.loss_scale.visual_prompt |
Loss weight for VP prediction (default: 0.1) |
--trainer.max_train_steps |
Total training steps |
--trainer.learning_rate.base |
Base learning rate |
--trainer.learning_rate.qwen_vl_interface |
Learning rate for the Qwen VL interface |
@article{wang2026vp,
title={VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models},
author={Wang, Zixuan and Chen, Yuxin and Liu, Yuqi and Ye, Jinhui and Chen, Pengguang and Lu, Changsheng and Liu, Shu and Jia, Jiaya},
journal={arXiv preprint arXiv:2603.22003},
year={2026}
}We would like to thank the following repos for their great work:

