Extend on-device sampling support for dual QPC VLMs #597

quic-xiyushi · 2025-10-24T00:01:48Z

Overview

On-device sampling can significantly reduce host overhead and improve inference throughput; however, so far it has only been implemented for QEffForCausalLM models. This PR extends on-device sampling support to the language decoder of dual QPC vision language models, QEffCausalLMForTextImageToTextModel. In addition, it fixes the bug in gumbel noise so that it correctly simulates a multinomial distribution for random sampling.

Implementation details

class _QEffAutoModelForImageTextToTextDualQPC:

def __init__(
        self,
        model: nn.Module,
        continuous_batching: bool = False,
        qaic_config: Optional[dict] = None,
        **kwargs,
    ):
        # Omitting unchanged parts
        self.lang_model = QEffCausalLMForTextImageToTextModel(model, qaic_config=qaic_config, **kwargs)
        # ---Sampling---
        # Note: SamplerTransform should be applied after all other transforms
        # are done. The role of the sampler is to just add nodes at the output of the
        # previous transform function.
        self.lang_model.model, _ = SamplerTransform.apply(self.lang_model.model, qaic_config, **kwargs)

Usage

The usage is the similar to enable on-device sampling for QEffForCausalLM.

from QEfficient import QEFFAutoModelForImageTextToText

model_id = "Qwen/Qwen2.5-VL-3B-Instruct"

qeff_model = QEFFAutoModelForImageTextToText.from_pretrained(
    model_id,
    attn_implementation="eager",
    kv_offload=True,
    continuous_batching=True,
    qaic_config={
        "include_sampler": True,
        "return_pdfs": False,
        "max_top_k_ids": 512,
    },
)

quic-hemagnih

Can you please add the CI test cases.

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Signed-off-by: quic-sanising <sanising@qti.qualcomm.com> Signed-off-by: sanising <sanising@qti.qualcomm.com>

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Signed-off-by: sanising <sanising@qti.qualcomm.com>

QEfficient/transformers/models/modeling_auto.py

ochougul · 2025-11-12T09:10:24Z

tests/transformers/sampler/test_sampler.py

can you add a test for intern model i,e VLM in dual qpc mode?

InternVL does not support the new generation interface yet, which will be later be added in PR #610. So instead of making changes to the legacy API, I added tests for Qwen2.5VL, which suports continuous batching in the current mainline.

ochougul · 2025-11-12T09:11:06Z

QEfficient/transformers/models/pytorch_transforms.py

        QEffGPTJForCausalLM,
        QEffGraniteForCausalLM,
        QEffGraniteMoeForCausalLM,
+        QEffInternDecoderWrapper,


Does this mean we are enabling sampling only for intern model?
Will other VLMs also be supported?

Other VLMs are also supposed to be supported. But currently only InternVL and Qwen VL 2.5 have been tested.

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

tests/transformers/sampler/test_sampler.py

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi · 2025-11-21T02:17:15Z

Can you please add the CI test cases.

@quic-hemagnih CI added. Please review this PR again. Thank you!

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

asmigosw · 2025-12-02T10:07:29Z

QEfficient/transformers/sampler/sampler.py

    top_ps: Optional[torch.Tensor] = None,
    min_ps: Optional[torch.Tensor] = None,
    random_numbers: Optional[torch.Tensor] = None,
+    vision_embeds: Optional[torch.Tensor] = None,


Please add these both vision_embeds and image_idx in docs Args list.

quic-mamta · 2025-11-28T06:25:40Z

QEfficient/transformers/sampler/sampler.py

    min_ps: Optional[torch.Tensor] = None,
    random_numbers: Optional[torch.Tensor] = None,
+    vision_embeds: Optional[torch.Tensor] = None,
+    image_idx: Optional[torch.Tensor] = None,


please keep dtype of these 2 consistent as per lines 27-28. also update function docstring for these newly added args.

quic-mamta · 2025-12-05T05:34:31Z

Please resolve the conflicts.

quic-xiyushi requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners October 24, 2025 00:01

quic-xiyushi force-pushed the on-device-sampling-vlm branch 2 times, most recently from af8e673 to df3501a Compare October 30, 2025 07:13

quic-hemagnih requested changes Oct 30, 2025

View reviewed changes

Extend on-device sampling support for dual QPC VLMs

409da24

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi force-pushed the on-device-sampling-vlm branch from df3501a to d722a5a Compare November 10, 2025 17:22

Fix random_numbers shape

e06e175

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi force-pushed the on-device-sampling-vlm branch from d722a5a to e06e175 Compare November 10, 2025 17:25

Update example with new random sampling logic

3e242ce

Signed-off-by: quic-sanising <sanising@qti.qualcomm.com> Signed-off-by: sanising <sanising@qti.qualcomm.com>

quic-sanising force-pushed the on-device-sampling-vlm branch from 900aee5 to 3e242ce Compare November 11, 2025 00:14

quic-xiyushi and others added 2 commits November 10, 2025 16:35

Update to align with recent VLM CB changes

1a01d57

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Update tests with new random sampling logic

30d6061

Signed-off-by: sanising <sanising@qti.qualcomm.com>

ochougul requested changes Nov 12, 2025

View reviewed changes

ochougul assigned quic-xiyushi Nov 12, 2025

ochougul added the enhancement New feature or request label Nov 12, 2025

Merge remote-tracking branch 'origin/main' into HEAD

d02d04d

quic-sanising mentioned this pull request Nov 19, 2025

Add Support for Guided Decoding to On Device Sampling #624

Open

quic-xiyushi added 4 commits November 20, 2025 11:28

Refactor

7cf106e

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Add unit tests

45aed11

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Clean up

6273ab5

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Merge remote-tracking branch 'origin/main' into HEAD

ef9ae14

quic-xiyushi force-pushed the on-device-sampling-vlm branch from cc44ad0 to ef9ae14 Compare November 20, 2025 19:31

quic-xiyushi requested a review from quic-sanising November 20, 2025 19:33

quic-sanising suggested changes Nov 20, 2025

View reviewed changes

tests/transformers/sampler/test_sampler.py Outdated Show resolved Hide resolved

tests/transformers/sampler/test_sampler.py Outdated Show resolved Hide resolved

quic-xiyushi added 2 commits November 20, 2025 13:24

Update test_sampler.py

3789d5a

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

Fix hash for VLM's language decoder to include qaic_config

5e2afb7

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi force-pushed the on-device-sampling-vlm branch from 8d00cb1 to 5e2afb7 Compare November 21, 2025 02:15

quic-xiyushi requested review from ochougul and quic-hemagnih November 22, 2025 01:42

Merge remote-tracking branch 'origin/main' into HEAD

df06617

quic-xiyushi force-pushed the on-device-sampling-vlm branch from 7d06a75 to a0716fa Compare November 25, 2025 22:19

Fix bug in getting vocab_size and missing ccl in forward

10990a9

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>

quic-xiyushi force-pushed the on-device-sampling-vlm branch from a0716fa to 10990a9 Compare November 25, 2025 22:21

quic-hemagnih requested review from asmigosw and quic-mamta November 26, 2025 09:18

asmigosw suggested changes Dec 2, 2025

View reviewed changes

quic-mamta reviewed Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extend on-device sampling support for dual QPC VLMs #597

Extend on-device sampling support for dual QPC VLMs #597

quic-xiyushi commented Oct 24, 2025 •

edited

Loading

Uh oh!

quic-hemagnih left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ochougul Nov 12, 2025

Uh oh!

quic-xiyushi Nov 20, 2025

Uh oh!

ochougul Nov 12, 2025

Uh oh!

quic-xiyushi Nov 20, 2025

Uh oh!

Uh oh!

Uh oh!

quic-xiyushi commented Nov 21, 2025

Uh oh!

asmigosw Dec 2, 2025

Uh oh!

quic-mamta Nov 28, 2025

Uh oh!

quic-mamta commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Extend on-device sampling support for dual QPC VLMs #597

Are you sure you want to change the base?

Extend on-device sampling support for dual QPC VLMs #597

Conversation

quic-xiyushi commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quic-hemagnih left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ochougul Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

quic-xiyushi Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

ochougul Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

quic-xiyushi Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

quic-xiyushi commented Nov 21, 2025

Uh oh!

asmigosw Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

quic-mamta Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

quic-mamta commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

quic-xiyushi commented Oct 24, 2025 •

edited

Loading