Skip to content

Conversation

@quic-xiyushi
Copy link

@quic-xiyushi quic-xiyushi commented Oct 24, 2025

Overview

On-device sampling can significantly reduce host overhead and improve inference throughput; however, so far it has only been implemented for QEffForCausalLM models. This PR extends on-device sampling support to the language decoder of dual QPC vision language models, QEffCausalLMForTextImageToTextModel. In addition, it fixes the bug in gumbel noise so that it correctly simulates a multinomial distribution for random sampling.

Implementation details

class _QEffAutoModelForImageTextToTextDualQPC:

def __init__(
        self,
        model: nn.Module,
        continuous_batching: bool = False,
        qaic_config: Optional[dict] = None,
        **kwargs,
    ):
        # Omitting unchanged parts
        self.lang_model = QEffCausalLMForTextImageToTextModel(model, qaic_config=qaic_config, **kwargs)
        # ---Sampling---
        # Note: SamplerTransform should be applied after all other transforms
        # are done. The role of the sampler is to just add nodes at the output of the
        # previous transform function.
        self.lang_model.model, _ = SamplerTransform.apply(self.lang_model.model, qaic_config, **kwargs)

Usage

The usage is the similar to enable on-device sampling for QEffForCausalLM.

from QEfficient import QEFFAutoModelForImageTextToText

model_id = "Qwen/Qwen2.5-VL-3B-Instruct"

qeff_model = QEFFAutoModelForImageTextToText.from_pretrained(
    model_id,
    attn_implementation="eager",
    kv_offload=True,
    continuous_batching=True,
    qaic_config={
        "include_sampler": True,
        "return_pdfs": False,
        "max_top_k_ids": 512,
    },
)

@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch 2 times, most recently from af8e673 to df3501a Compare October 30, 2025 07:13
Copy link
Contributor

@quic-hemagnih quic-hemagnih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add the CI test cases.

Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch from df3501a to d722a5a Compare November 10, 2025 17:22
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch from d722a5a to e06e175 Compare November 10, 2025 17:25
Signed-off-by: quic-sanising <sanising@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
quic-xiyushi and others added 2 commits November 10, 2025 16:35
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: sanising <sanising@qti.qualcomm.com>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a test for intern model i,e VLM in dual qpc mode?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InternVL does not support the new generation interface yet, which will be later be added in PR #610. So instead of making changes to the legacy API, I added tests for Qwen2.5VL, which suports continuous batching in the current mainline.

QEffGPTJForCausalLM,
QEffGraniteForCausalLM,
QEffGraniteMoeForCausalLM,
QEffInternDecoderWrapper,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean we are enabling sampling only for intern model?
Will other VLMs also be supported?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other VLMs are also supposed to be supported. But currently only InternVL and Qwen VL 2.5 have been tested.

@ochougul ochougul added the enhancement New feature or request label Nov 12, 2025
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch from 8d00cb1 to 5e2afb7 Compare November 21, 2025 02:15
@quic-xiyushi
Copy link
Author

Can you please add the CI test cases.

@quic-hemagnih CI added. Please review this PR again. Thank you!

@quic-xiyushi quic-xiyushi force-pushed the on-device-sampling-vlm branch from 7d06a75 to a0716fa Compare November 25, 2025 22:19
Signed-off-by: quic-xiyushi <xiyushi@qti.qualcomm.com>
top_ps: Optional[torch.Tensor] = None,
min_ps: Optional[torch.Tensor] = None,
random_numbers: Optional[torch.Tensor] = None,
vision_embeds: Optional[torch.Tensor] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add these both vision_embeds and image_idx in docs Args list.

min_ps: Optional[torch.Tensor] = None,
random_numbers: Optional[torch.Tensor] = None,
vision_embeds: Optional[torch.Tensor] = None,
image_idx: Optional[torch.Tensor] = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please keep dtype of these 2 consistent as per lines 27-28. also update function docstring for these newly added args.

@quic-mamta
Copy link
Contributor

Please resolve the conflicts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants