Yuhao Xu1
Yantai Yang1,2
Zhenyang Fan1
Yufan Liu3,4
Yuming Li5
Bing Li3
Zhipeng Zhang1
1AutoLab, School of Artificial Intelligence, Shanghai Jiao Tong University
2Anyverse Dynamics
3State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences
4School of Artificial Intelligence, University of Chinese Academy of Sciences
5Terminal Technology Department, Alipay, Ant Group
QVLA provides a quantization workflow for VLA models, including proxy sensitivity estimation, greedy gate assignment, and quantization for evaluation or checkpoint export.
- [Jan 26, 2026] Accepted to ICLR 2026.
The advent of Vision-Language-Action (VLA) models represents a significant leap for embodied intelligence, yet their immense computational demands critically hinder deployment on resource-constrained robotic platforms. Intuitively, low-bit quantization is a prevalent and preferred technique for large-scale model compression. However, we find that a systematic analysis of VLA model's quantization is fundamentally lacking. We argue that naively applying uniform-bit quantization from Large Language Models (LLMs) to robotics is flawed, as these methods prioritize passive data fidelity while ignoring how minor action deviations compound into catastrophic task failures. To bridge this gap, we introduce QVLA, the first action-centric quantization framework specifically designed for embodied control. In a sharp departure from the rigid, uniform-bit quantization of LLM-based methods, QVLA introduces a highly granular, channel-wise bit allocation strategy. Its core mechanism is to directly measure the final action-space sensitivity when quantizing each individual channel to various bit-widths. This process yields a precise, per-channel importance metric that guides a global optimization, which elegantly unifies quantization and pruning (0-bit) into a single, cohesive framework. Extensive evaluations on different baselines demonstrate the superiority of our approach. In LIBERO, the quantized OpenVLA-OFT with QVLA requires only 29.2% of the original model's VRAM while maintaining 98.9% of its original performance and achieving a 1.49x speedup. This translates to a 22.6% performance improvement over SmoothQuant.
- Channel-wise gates over
{0,2,4,8,16}with a target global average bit-width. - Works on
language_model.*andvision_backbone.*Linear/Conv2d layers. - Excludes
projector.*,action_head, andlanguage_model.lm_head.
openvla/OpenVLA source code and dependenciesopenvla/qvla/QVLA workflow scripts
The examples below use OpenVLA as the backend.
# Create and activate conda environment
conda create -n openvla python=3.10 -y
conda activate openvla
# Install PyTorch. Update CUDA version to match your system.
conda install pytorch torchvision torchaudio pytorch-cuda=12.4 -c pytorch -c nvidia -y # UPDATE ME
# Install OpenVLA in editable mode
pip install -e openvla
# Minimal dependencies for QVLA scripts
pip install -r openvla/requirements-min.txt
pip install pillow tqdm
# (Optional) Flash Attention 2 for training
pip install packaging ninja
ninja --version; echo $? # should return exit code 0
pip install "flash-attn==2.5.5" --no-build-isolationAll commands below are run from the repository root.
python openvla/qvla/sensitivity_hessian_proxy.py \
--pretrained_checkpoint path/to/openvla_checkpoint \
--calib_jsonl path/to/calib.jsonl \
--out_path out/proxy.pt \
--bits 0,2,4,8python openvla/qvla/assign_gates_from_sensitivity.py \
--proxy_pt out/proxy.pt \
--bits 0,4,8,16 \
--target_avg_bits 8.0 \
--out_json out/greedy_bits.jsonpython openvla/qvla/inject_fake_w.py \
--pretrained_checkpoint path/to/openvla_checkpoint \
--gates_path out/greedy_bits.json \
--out_dir out/openvla_qvla_fakewSet LIBERO_ROOT to your local LIBERO checkout if needed.
export LIBERO_ROOT=path/to/LIBERO
python openvla/qvla/run_eval_with_qvla_fakew.py \
--pretrained_checkpoint path/to/openvla_checkpoint \
--gates_path out/greedy_bits.json \
--task_suite_name libero_spatial \
--num_trials_per_task 1 \
--local_log_dir out/rollouts_qvlaout/proxy.pt: per-layerproxy_{b}tensorsout/greedy_bits.json: channel-wise gate assignmentout/openvla_qvla_fakew/: exported fake-quantized checkpointout/rollouts_qvla/: evaluation logs
@misc{xu2026qvlachannelsequalvisionlanguageaction,
title={QVLA: Not All Channels Are Equal in Vision-Language-Action Model's Quantization},
author={Yuhao Xu and Yantai Yang and Zhenyang Fan and Yufan Liu and Yuming Li and Bing Li and Zhipeng Zhang},
year={2026},
eprint={2602.03782},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.03782},
}