Hi, thanks for your great work on QuantVLA!
I have two questions regarding the practical deployment pipeline:
ONNX export
Can the quantized model (with selective quantization layout + ATM/OHB scalars folded into dequantization scales) be exported to ONNX smoothly?
Are there any known issues with ops like Round, Clip, or the fused rotation matrices (from DuQuant) when exporting to ONNX? Do you have any recommended export flags or a sample script?
TensorRT engine & inference speedup
After converting to TensorRT engine (e.g., using INT8/FP16 mixed precision), can we actually achieve measurable latency reduction compared to the FP16 baseline on edge GPUs (e.g., Orin AGX)?
In the paper, you mainly report memory savings. Could you share any rough inference time speedup numbers (e.g., ms/step or FPS) under W4A8 or W4A4 on edge-like hardware? If not yet measured, do you have any expectations based on your internal experiments?
Any guidance or known limitations would be very helpful. Thanks in advance!
Hi, thanks for your great work on QuantVLA!
I have two questions regarding the practical deployment pipeline:
ONNX export
Can the quantized model (with selective quantization layout + ATM/OHB scalars folded into dequantization scales) be exported to ONNX smoothly?
Are there any known issues with ops like Round, Clip, or the fused rotation matrices (from DuQuant) when exporting to ONNX? Do you have any recommended export flags or a sample script?
TensorRT engine & inference speedup
After converting to TensorRT engine (e.g., using INT8/FP16 mixed precision), can we actually achieve measurable latency reduction compared to the FP16 baseline on edge GPUs (e.g., Orin AGX)?
In the paper, you mainly report memory savings. Could you share any rough inference time speedup numbers (e.g., ms/step or FPS) under W4A8 or W4A4 on edge-like hardware? If not yet measured, do you have any expectations based on your internal experiments?
Any guidance or known limitations would be very helpful. Thanks in advance!