Added fp16/bf16 based export and compile support for VLMs by asmigosw · Pull Request #819 · quic/efficient-transformers

asmigosw · 2026-03-02T08:10:47Z

Added fp16/bf16 based export and compile support for VLMs

QEfficient/base/modeling_qeff.py

QEfficient/transformers/models/internvl/modeling_internvl.py

QEfficient/transformers/models/llama4/modeling_llama4.py

vbaddi · 2026-03-04T04:58:05Z

QEfficient/transformers/models/modeling_auto.py

                retained_state=True,
                specializations=specializations["lang"],
-                convert_to_fp16=True,
+                convert_to_fp16=(DTYPE_TO_STRING_MAP[needed_dtype] == "float16"),


nit: why is this condition? required for AI200? @quic-rishinr

This condition is required in case user wants bf16 support which will come in AI200, I have updated the code to convert_to_fp16 = True when passed dtype is either fp16 or fp32.

tests/transformers/models/test_causal_lm_models.py

QEfficient/utils/generate_inputs.py

QEfficient/transformers/models/modeling_auto.py

QEfficient/transformers/models/llama_swiftkv/modeling_llama_swiftkv.py

QEfficient/transformers/models/internvl/modeling_internvl.py

QEfficient/transformers/models/llama4/modeling_llama4.py

QEfficient/transformers/models/molmo/modeling_molmo.py

QEfficient/transformers/models/modeling_auto.py

vbaddi · 2026-03-11T07:56:41Z

QEfficient/base/modeling_qeff.py

            self.model, transformed = transform.apply(self.model)
            any_transformed = any_transformed or transformed

+        self._normalize_torch_dtype()


nit: does this take care of embedding and ASR models too?

vbaddi · 2026-03-11T08:01:00Z

tests/transformers/models/image_text_to_text/test_image_text_to_text_models.py

+        "allenai/Molmo-7B-D-0924",
+        "meta-llama/Llama-3.2-11B-Vision-Instruct",
+    ]:
+        pytest.skip("Test skipped for this model due to some issues.")


nit: with our dummy configs, can we run all sample lm models w/this test quickly?

quic-rishinr · 2026-03-13T08:57:46Z

QEfficient/transformers/models/modeling_auto.py

            torch.nn.functional.pad(inputs["input_values"], (0, self.seq_len - input_ids_len), "constant", 0)
        )
+        needed_dtype = self.model.config.torch_dtype
+        input_values = input_values.astype(CUSTOM_IO_DTYPE_MAP[needed_dtype])


Since inputs are in numpy format we should be using Torch_to_numpy_map right?

QEfficient/transformers/models/gemma3/modeling_gemma3.py

quic-rishinr · 2026-03-13T09:18:13Z

QEfficient/transformers/models/mixtral_moe/modeling_mixtral.py

        router_logits = self.gate(hidden_states)
-
-        routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
+        routing_weights = F.softmax(router_logits, dim=1, dtype=self.gate.weight.dtype)


Can we update the softmax to be in original percision?

QEfficient/transformers/models/gpt2/modeling_gpt2.py

QEfficient/transformers/models/grok_1/modeling_grok1.py

quic-rishinr · 2026-03-13T09:32:40Z

QEfficient/transformers/models/gemma3/modeling_gemma3.py

    def forward(hidden_states: torch.Tensor, weight: torch.Tensor, epsilon: float):
-        hidden_states = hidden_states.to(torch.float32)
-        div_first = hidden_states * torch.rsqrt(torch.tensor(hidden_states.shape[-1], dtype=torch.float32))
+        div_first = hidden_states * torch.rsqrt(torch.tensor(hidden_states.shape[-1]))


RMS norm would create issue if we set in default precision similar to softmax. We should verify if its causing the issue. if its causing the issue we should revert this

I tested with changing softmax to default, and it ran successfully.

QEfficient/transformers/models/olmo2/modeling_olmo2.py

QEfficient/transformers/models/modeling_auto.py

QEfficient/base/modeling_qeff.py

QEfficient/transformers/models/modeling_auto.py

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

… and inference. Almost all LLMs can now be compiled and infered in fp16, test_causal_lm_models script has the following notion regarding how the tests happened : # means the model wasn't tested due to the size, not sure if it'll run through or have an accuracy mismatch. ## means the ouputs match for fp16 and things worked fine. ### means, outputs come but don't match properly with HF tokens. #### means they're quantized model and additional effort is needed to enable these. These commits cover almost all LLMs currently supported. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Enabled CI tests for fp16 based LMs, embedding and sequence classification models. Modified CI based config for LLM tests. Embedding models have high MAD for fp16 exported models(~0.015) Certain CausalLMs cause a token mismatch after few tokens for fp16 setup. Whisper model has a clip operator issue for fp16 exported models so its not enabled yet. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added a try catch setup for dtype casting of model weights post loading since gptq type models don't allow such conversion. Fixed a few dtype related issues for Audio based models. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

…both in bfloat16. Added a patch incloud infer to map bfloat16 or 11 key type to np.float16 for AI200 inference. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

…e to False for _compile when appropriate params are missing. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>

asmigosw requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners March 2, 2026 08:10

quic-rishinr marked this pull request as draft March 2, 2026 09:55

vbaddi requested changes Mar 4, 2026

View reviewed changes

quic-rishinr requested changes Mar 11, 2026

View reviewed changes

vbaddi requested changes Mar 11, 2026

View reviewed changes

quic-rishinr marked this pull request as ready for review March 13, 2026 08:34

quic-dhirajku force-pushed the custom_dtype branch from d84668e to c658e0f Compare March 13, 2026 09:07

quic-rishinr requested changes Mar 13, 2026

View reviewed changes

asmigosw force-pushed the custom_dtype branch from a0db0d6 to 9cd7729 Compare March 13, 2026 12:09

quic-rishinr force-pushed the custom_dtype branch from 1e15c1a to 9df3b31 Compare March 17, 2026 17:31

asmigosw and others added 17 commits March 18, 2026 05:49

Added fp16/bf16 based export and compile for InternVL Model

f76e043

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Ruff format

edef200

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added bf16/fp16/fp32 support for mistral3

853d999

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added changes for Llama4

ebda0e8

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Ruff check

28d6499

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added custom dtype support for Molmo

aa659cb

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added custom dtype support for llava_next

848577e

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Ruff format

8da9eac

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added custom_dtype support for Qwen2_5_vl

41addfa

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added custom_dtype support for mllama

c274445

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Ruff format

fd9a5a7

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added custom_dtype support for Gemma3

2ad6706

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

BF16 changes to be used

0326bd0

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added custom dtype support for llava

64fa655

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Ruff format

4f81aa8

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Updated logits to dtype float32

97b515a

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

asmigosw and others added 16 commits March 18, 2026 05:49

Updatd the test file

43b351c

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added custom_dtype support for wav2vec2

e938886

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Comments Addressed

9cdcb08

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Addressed Comments

60a088f

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Updated custom fp16 models for causalLM

8d21947

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Updated needed_dtype to handle edge cases

ae00b72

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Removed grok config from CI models list

fc9e39b

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Fixed grok1 model CI issue, added the custom config back

d376e7b

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Added default dtype for string and None case

2b19a39

Signed-off-by: asmigosw <asmigosw@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Updating QAIC LLM Test Time

541f9df

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

Replaced some model configs for quicker CI tests for LLMs

babf230

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com> Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

CI failures addressed

61422cb

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

removing comments

2915e49

Signed-off-by: Asmita Goswami <asmigosw@qti.qualcomm.com>

asmigosw force-pushed the custom_dtype branch from 9df3b31 to 8680d1a Compare March 18, 2026 05:51

quic-dhirajku added 2 commits March 18, 2026 08:16

Added additional check to default bf16 model dtype and pkv cache dtyp…

a9494aa

…e to False for _compile when appropriate params are missing. Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>

Undo unit test for HL API Tests.

fc45acd

Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>

Conversation

asmigosw commented Mar 2, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vbaddi Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

asmigosw Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vbaddi Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

vbaddi Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

quic-rishinr Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quic-rishinr Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

quic-rishinr Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

asmigosw Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

asmigosw Mar 9, 2026 •

edited

Loading