Skip to content

[Question] About ONNX export and TensorRT deployment of quantized models (W4A8) #7

@walker2026-v

Description

@walker2026-v

Hi, thanks for your great work on QuantVLA!

I have two questions regarding the practical deployment pipeline:

ONNX export
Can the quantized model (with selective quantization layout + ATM/OHB scalars folded into dequantization scales) be exported to ONNX smoothly?
Are there any known issues with ops like Round, Clip, or the fused rotation matrices (from DuQuant) when exporting to ONNX? Do you have any recommended export flags or a sample script?

TensorRT engine & inference speedup
After converting to TensorRT engine (e.g., using INT8/FP16 mixed precision), can we actually achieve measurable latency reduction compared to the FP16 baseline on edge GPUs (e.g., Orin AGX)?
In the paper, you mainly report memory savings. Could you share any rough inference time speedup numbers (e.g., ms/step or FPS) under W4A8 or W4A4 on edge-like hardware? If not yet measured, do you have any expectations based on your internal experiments?

Any guidance or known limitations would be very helpful. Thanks in advance!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions