Skip to content

AutoLab-SAI-SJTU/QVLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization

Yuhao Xu1   Yantai Yang1,2   Zhenyang Fan1   Yufan Liu3,4  
Yuming Li5   Bing Li3   Zhipeng Zhang1  

1AutoLab, School of Artificial Intelligence, Shanghai Jiao Tong University  
2Anyverse Dynamics  
3State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences  
4School of Artificial Intelligence, University of Chinese Academy of Sciences  
5Terminal Technology Department, Alipay, Ant Group

arXiv

QVLA provides a quantization workflow for VLA models, including proxy sensitivity estimation, greedy gate assignment, and quantization for evaluation or checkpoint export.

📰 News

  • [Jan 26, 2026] Accepted to ICLR 2026.

📖 Abstract

The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model's quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce QVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, QVLA introduces a highly granular, channel-wise bit allocation strategy. Its core mechanism is to directly measure the final action-space sensitivity when quantizing each individual channel to various bit-widths. This process yields a precise, per-channel importance metric that guides a global optimization, which elegantly unifies quantization and pruning (0-bit) into a single, cohesive framework. Extensive evaluations on different baselines demonstrate the superiority of our approach. In LIBERO, the quantized OpenVLA-OFT with QVLA requires only 29.2% of the original model's VRAM while maintaining 98.9% of its original performance and achieving a 1.49x speedup. This translates to a 22.6% performance improvement over SmoothQuant.

✨ Highlights

  • Channel-wise gates over {0,2,4,8,16} with a target global average bit-width.
  • Works on language_model.* and vision_backbone.* Linear/Conv2d layers.
  • Excludes projector.*, action_head, and language_model.lm_head.

🖼️ Figure

Sensitivity Analysis Sensitivity Analysis

📁 Repository Layout

  • openvla/ OpenVLA source code and dependencies
  • openvla/qvla/ QVLA workflow scripts

⚙️ Installation

The examples below use OpenVLA as the backend.

# Create and activate conda environment
conda create -n openvla python=3.10 -y
conda activate openvla

# Install PyTorch. Update CUDA version to match your system.
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia -y  # UPDATE ME

# Install OpenVLA in editable mode
pip install -e openvla

# Minimal dependencies for QVLA scripts
pip install -r openvla/requirements-min.txt
pip install pillow tqdm

# (Optional) Flash Attention 2 for training
pip install packaging ninja
ninja --version; echo $?  # should return exit code 0
pip install "flash-attn==2.5.5" --no-build-isolation

🚀 Usage

All commands below are run from the repository root.

1) Build proxy sensitivity

python openvla/qvla/sensitivity_hessian_proxy.py \
  --pretrained_checkpoint path/to/openvla_checkpoint \
  --calib_jsonl path/to/calib.jsonl \
  --out_path out/proxy.pt \
  --bits 0,2,4,8

2) Assign gates

python openvla/qvla/assign_gates_from_sensitivity.py \
  --proxy_pt out/proxy.pt \
  --bits 0,4,8,16 \
  --target_avg_bits 8.0 \
  --out_json out/greedy_bits.json

3) Inject weight-only fake quant and save model

python openvla/qvla/inject_fake_w.py \
  --pretrained_checkpoint path/to/openvla_checkpoint \
  --gates_path out/greedy_bits.json \
  --out_dir out/openvla_qvla_fakew

4) Evaluate on LIBERO

Set LIBERO_ROOT to your local LIBERO checkout if needed.

export LIBERO_ROOT=path/to/LIBERO

python openvla/qvla/run_eval_with_qvla_fakew.py \
  --pretrained_checkpoint path/to/openvla_checkpoint \
  --gates_path out/greedy_bits.json \
  --task_suite_name libero_spatial \
  --num_trials_per_task 1 \
  --local_log_dir out/rollouts_qvla

📦 Outputs

  • out/proxy.pt: per-layer proxy_{b} tensors
  • out/greedy_bits.json: channel-wise gate assignment
  • out/openvla_qvla_fakew/: exported fake-quantized checkpoint
  • out/rollouts_qvla/: evaluation logs

📚 Citation

@misc{xu2026qvlachannelsequalvisionlanguageaction,
      title={QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization}, 
      author={Yuhao Xu and Yantai Yang and Zhenyang Fan and Yufan Liu and Yuming Li and Bing Li and Zhipeng Zhang},
      year={2026},
      eprint={2602.03782},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2602.03782}, 
}

About

[ICLR'26]QVLA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages