VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

Vision-Language-Action (VLA) models often struggle with precise spatial grounding and robustness due to monolithic end-to-end designs. In this project, we introduce that decouples high-level reasoning and low-level execution via a structured visual prompting interface, enabling more precise and reliable robotic manipulation.

VP-VLA_supp_video.mp4

Overview of VP-VLA

VP-VLA demonstrates the following features:

Dual-System Architecture: VP-VLA decomposes robotic manipulation into:
- System 2 Planner (high-level reasoning)
- System 1 Controller (low-level execution)
Visual Prompt Interface: Instead of relying solely on text, VP-VLA converts language instructions into structured visual prompts (crosshairs and bounding boxes), enabling precise spatial grounding.
Improved Spatial Precision & Robustness: By grounding actions in visual space, the framework significantly improves performance in:
- Novel object scenarios
- Out-of-distribution (OOD) spatial configurations
General Multi-Stage Manipulation: VP-VLA supports complex, multi-step tasks via:
- Task decomposition
- Event-driven planning
- Dynamic visual prompt updates

News

[Apr 12th, 2026] 🔥 Code released!

[Mar 24th, 2026] 🔥 📖 Paper released!

Model

Instead of solving everything in one forward pass, VP-VLA does the following:

Language → Visual Prompts → Actions

This transforms the problem into visuomotor tracking of explicit spatial cues, improving precision and interpretability.

Installation

First, prepare the training environment. We follow the same installation from starVLA:

git clone https://github.com/JIA-Lab-research/VP-VLA.git
cd VP-VLA
# Create conda environment
conda create -n starVLA python=3.10 -y
conda activate starVLA

# Install requirements
pip install -r requirements.txt

# Install FlashAttention2
pip install flash-attn --no-build-isolation

# Install StarVLA
pip install -e .

Next, construct the evaluation environment:

Robocasa-Tabletop Please first follow the RoboCasa installation guide in starVLA to install the base robocasa environment.

Note: Please install the robosuite package with robosuite==1.5.1. Install it with:

pip install robosuite==1.5.1

SimplerEnv Please first follow the SimplerEnv installation guide in starVLA to install the SimplerEnv environment.
SAM Environment Please follow the SAM3 installation guide in sam3 to install the SAM 3 environment. Additionally, install transformers to support loading the VLM: pip install git+https://github.com/huggingface/transformers

Pre-trained Models

Download the base VLM model: Qwen3-VL-4B-Instruct — place it under ./playground/Pretrained_models/Qwen3-VL-4B-Instruct.
Download the SAM 3 checkpoint. Remember to change the default path in examples/Robocasa_tabletop/visual_prompt_utility/sam3_server.py

Evaluation

VP-VLA evaluation requires three concurrent services — a Policy Server (the trained VLA model), a SAM3 Server (for visual prompt generation), and a VLM Server (for task decomposition and subtask detection) — plus the environment-specific evaluation entry point.

Prerequisites

The evaluation relies on three separate conda environments:

Environment	Purpose
`starVLA`	Policy server
`sam3`	SAM3 + VLM servers
`simpler_env` (SimplerEnv only)	Simulation environment
`robocasa` (Robocasa only)	Simulation environment

SimplerEnv

cd VP-VLA

# Edit the python paths and SimplerEnv path in the script first
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash examples/SimplerEnv/eval_files/auto_eval_scripts/run_eval.sh /path/to/checkpoint.pt

The script automatically:

Launches policy / SAM3 / VLM servers on each GPU
Runs all bridge tasks with visual prompting
Collects results and saves overlay videos

Modify the following paths at the top of the script:

star_vla_python: path to starVLA conda python
sim_python: path to simpler_env conda python
sam3_python: path to SAM3 conda python
SimplerEnv_PATH: path to the SimplerEnv repository that you cloned

Robocasa

cd VP-VLA

# Edit the python paths in the script first
bash examples/Robocasa_tabletop/eval_files/run_eval.sh /path/to/checkpoint.pt

The script automatically:

Launches policy / SAM3 / VLM servers (one per GPU)
Dispatches 24 tabletop environments across GPUs
Monitors for crashes and re-queues failed evaluations (up to 5 retries)

Modify the following paths at the top of the script:

starVLA_PYTHON: path to starVLA conda python
ROBOCASA_PYTHON: path to robocasa conda python
SAM3_PYTHON: path to SAM3 conda python

Data Preparation

VP-VLA requires pre-computed visual prompt data for training. The data preparation pipeline uses VLM (for subtask decomposition and target identification) and SAM3 (for segmentation-based visual prompt generation) to process each episode in the dataset. The below scripts require pre-extract the frames from the dataset. The pre-extracted frames will also be used during training for visual prompt prediction.

Robocasa

cd VP-VLA

# Edit the paths at the top of the script first (DATASET_ROOT, FRAMES_ROOT, OUTPUT_DIR, SAM_MODEL_PATH, VLM_MODEL_PATH)
bash data_preparation/Robocasa_tabletop/run_parallel_processing.sh [NUM_GPUS] [SAM_SERVERS_PER_GPU] [VLM_SERVERS_PER_GPU] [WORKERS_PER_TASK]

The script:

Launches multiple SAM3 and VLM servers across GPUs
Distributes episode processing across worker processes, each paired with a dedicated SAM+VLM server
Outputs .npz files containing visual prompt overlays for each episode

Required paths to configure:

DATASET_ROOT: path to the Robocasa dataset (LeRobot format)
FRAMES_ROOT: path to pre-extracted JPEG frames
OUTPUT_DIR: output directory for visual prompt .npz files
SAM_MODEL_PATH: path to the SAM3 checkpoint
VLM_MODEL_PATH: path to the Qwen3-VL-4B-Instruct checkpoint

SimplerEnv (OXE)

cd VP-VLA

# Edit the paths at the top of the script first (DATASETS_ROOT, OUTPUT_DIR, SAM_MODEL_PATH, VLM_MODEL_PATH)
bash data_preparation/SimplerEnv/run_parallel_processing.sh [NUM_GPUS] [SAM_SERVERS_PER_GPU] [VLM_SERVERS_PER_GPU]

The script processes bridge_orig_lerobot and fractal20220817_data_lerobot datasets with the same parallel SAM+VLM server architecture.

Required paths to configure:

DATASETS_ROOT: path to the OXE datasets directory
OUTPUT_DIR: output directory for visual prompt .npz files
SAM_MODEL_PATH: path to the SAM3 checkpoint
VLM_MODEL_PATH: path to the Qwen3-VL-4B-Instruct checkpoint

Training

VP-VLA's System 1 Controller is trained with two concurrent objectives:

VLA action prediction — predicts continuous robot actions from observations with visual prompt overlays
VP location prediction — predicts visual prompt coordinates (crosshair center, bounding box) from the overlayed image

RoboCasa

cd VP-VLA
bash examples/Robocasa_tabletop/train_files/run_train.sh

Before running, modify the following paths in the script:

base_vlm: path to the Qwen3-VL checkpoint
data_root_dir: path to the RoboCasa dataset (LeRobot format)
visual_prompt_dir: path to the pre-computed visual prompt data
extracted_frames_dir: path to the pre-extracted frame images for VP prediction

Config file: examples/Robocasa_tabletop/train_files/starvla_cotrain_robocasa_visual_prompt.yaml

SimplerEnv (OXE)

cd VP-VLA
bash examples/SimplerEnv/train_files/run_train.sh

Before running, modify the following paths in the script:

base_vlm: path to the Qwen3-VL checkpoint
data_root_dir: path to the OXE dataset (LeRobot format)
visual_prompt_dir: path to the pre-computed visual prompt data
extracted_frames_dir: path to the pre-extracted frame images for VP prediction

Config file: examples/SimplerEnv/train_files/starvla_cotrain_oxe_visual_prompt.yaml

Key Training Arguments

Argument	Description
`--framework.name`	Model framework (default: `QwenOFT`)
`--framework.qwenvl.base_vlm`	Path to base VLM checkpoint
`--datasets.vla_data.data_mix`	Dataset mixture name
`--datasets.vla_data.feed_both_images true`	Feed both original and overlayed images to the VLA
`--trainer.loss_scale.visual_prompt`	Loss weight for VP prediction (default: `0.1`)
`--trainer.max_train_steps`	Total training steps
`--trainer.learning_rate.base`	Base learning rate
`--trainer.learning_rate.qwen_vl_interface`	Learning rate for the Qwen VL interface

Citation

@article{wang2026vp,
  title={VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models},
  author={Wang, Zixuan and Chen, Yuxin and Liu, Yuqi and Ye, Jinhui and Chen, Pengguang and Lu, Changsheng and Liu, Shu and Jia, Jiaya},
  journal={arXiv preprint arXiv:2603.22003},
  year={2026}
}

Acknowledgement

We would like to thank the following repos for their great work:

This work is built upon starVLA
This work utilizes models from Qwen3-VL and SAM3

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
data_preparation		data_preparation
deployment		deployment
examples		examples
starVLA		starVLA
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

Overview of VP-VLA

News

Contents

Model

Installation

Pre-trained Models

Evaluation

Prerequisites

SimplerEnv

Robocasa

Data Preparation

Robocasa

SimplerEnv (OXE)

Training

RoboCasa

SimplerEnv (OXE)

Key Training Arguments

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

Overview of VP-VLA

News

Contents

Model

Installation

Pre-trained Models

Evaluation

Prerequisites

SimplerEnv

Robocasa

Data Preparation

Robocasa

SimplerEnv (OXE)

Training

RoboCasa

SimplerEnv (OXE)

Key Training Arguments

Citation

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages