Skip to content

Onboarding Qwen3VL Dense#780

Draft
qcdipankar wants to merge 31 commits intoquic:mainfrom
qcdipankar:qwen3_vl
Draft

Onboarding Qwen3VL Dense#780
qcdipankar wants to merge 31 commits intoquic:mainfrom
qcdipankar:qwen3_vl

Conversation

@qcdipankar
Copy link
Contributor

Adding Qwen3VL Support to QEff

requires-python = ">=3.8,<3.11"
dependencies = [
"transformers==4.55.0",
"transformers==4.57.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@quic-rishinr / @quic-hemagnih : can we trigger TA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should raise it, and start the run of all the models with 4.57 in parallel, typically it takes 1week.

attention_mask, torch.tensor(MIN_MASKED_ATTENTION_VALUE, dtype=torch.float32), attn_weights
)

attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query.dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you set this to dtype passed from pretrained()

Copy link
Contributor

@quic-hemagnih quic-hemagnih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still reviewing the modelling file.


messages = [messages] * batch_size

inputs = processor.apply_chat_template(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can combine the code from line 62 to 77 and 122 to 140 at one place.

Idea is to avoid the code repetition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this we can discuss

requires-python = ">=3.8,<3.11"
dependencies = [
"transformers==4.55.0",
"transformers==4.57.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we should raise it, and start the run of all the models with 4.57 in parallel, typically it takes 1week.

qcdipankar and others added 7 commits February 16, 2026 13:12
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
@qcdipankar qcdipankar marked this pull request as draft February 19, 2026 09:16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add QEffQwen3VLDecoderWrapper here under SamplerTransform? The on-device sampling is generic, so it can support new VLMs. Thank you.

If not, we can also raise a new patch @quic-sanising

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please add this here @qcdipankar. Thanks!

@qcdipankar qcdipankar changed the base branch from main to qwen3_vl_mainline February 24, 2026 11:43
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <quic_dipankar@quicinc.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Co-authored-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dhiraj Kumar Sah <dhirajku@qti.qualcomm.com>
@qcdipankar qcdipankar force-pushed the qwen3_vl_mainline branch 3 times, most recently from 5bc0eb7 to 19a163b Compare March 2, 2026 06:37
@qcdipankar qcdipankar force-pushed the qwen3_vl_mainline branch 2 times, most recently from 46d18ab to 47dd748 Compare March 11, 2026 03:27
@qcdipankar qcdipankar changed the base branch from qwen3_vl_mainline to main March 17, 2026 05:42
@qcdipankar qcdipankar changed the base branch from main to qwen3_vl_mainline March 17, 2026 17:20
@qcdipankar qcdipankar changed the base branch from qwen3_vl_mainline to main March 17, 2026 17:21
@qcdipankar qcdipankar changed the base branch from main to qwen3_vl_mainline March 17, 2026 17:25
vjanfaza and others added 16 commits March 17, 2026 18:09
…gated Serving (quic#776)

In this PR, we are addressing the compilation error which is happening
when we enable CCL during decoding qpc generation of gpt-oss model in
Disaggregated Serving. For example, in the following command:
python3 -m qaic_disagg \
     --prefill-port 9802 \
     --decode-port 9902 \
     --port 8002 \
     --decode-device-group 16,17,18,19 \
     --prefill-device-group 20,21,22,23 \
     --model openai/gpt-oss-20b \
     --prefill-max-num-seqs 1 \
     --decode-max-num-seqs 1 \
     --prefill-max-seq-len-to-capture 128 \
     --max-model-len 4096 \
--prefill-override-qaic-config "split_retained_state_io:True
mxfp6_matmul:True enable_chunking:True" \
--decode-override-qaic-config "mxfp6_matmul:True retain_full_kv:True
ccl_enabled=True comp_ctx_lengths_decode=1024,2048,4096" \
     -vvv \
     --dtype bfloat16 \
     --kv-cache-dtype mxint8 \
     --kv-handOff-port 5068 \
     --tool-call-parser openai \
     --enable-auto-tool-choice \
     --enable-log-outputs 

We are activating CCL during decoding however this causes a compilation
error "Error message: No input that uniquely identifies specialization".
The source of this error is because of new changes in
modeling_gpt_oss.py script which were for the support of disaggregated
serving in gpt-oss however it causes error with CCL feature.

---------

Signed-off-by: Vahid Janfaza <vjanfaza@qti.qualcomm.com>
Co-authored-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>
Granitemoe export issue fixed and added to CI.

---------

Signed-off-by: Ann <akuruvil@qti.qualcomm.com>
Co-authored-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com>
…) (quic#779)

Needed for passing custom config via vllm.

---------

---------

Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Signed-off-by: Mamta Singh <mamtsing@qti.qualcomm.com>
Co-authored-by: Mamta Singh <mamtsing@qti.qualcomm.com>
This feature adds support for exporting a proxy model, which disables
the Embedding Layer and LM Head of the model.

Set `enable_proxy = True` to export the proxy model.
Set `write_io = True` to save input/output files during the generation
stage.

Refer to the example script for implementation details.

## Testing

1. Text Models
2. Embedding Models
3. Vision Models
4. Audio Models

Note: Check the Example Script for the same.

---------

Signed-off-by: Abukhoyer Shaik <abukhoye@qti.qualcomm.com>
Gemma3 NPI File Update. With new file namely gemma_updated_npi.yaml MMMU
metric is met.

---------

Signed-off-by: Hem Agnihotri <quic_hemagnih@quicinc.com>
Minor updates for better rendering in FT docs

---------

Signed-off-by: Ann Kuruvilla <akuruvil@qti.qualcomm.com>
Automated daily PR dashboard that generates report of all open pull
requests and emails it to a configured list of recipients.

---------

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Updated the SMPT server

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Removed git workflow and email test changes as we are moving to Jenkin
based approach

---------

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
updating the Qeff python version to 3.12 still keeping support for 3.10
3.11.

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Co-authored-by: Hem Agnihotri <hemagnih@qti.qualcomm.com>
**Adding disagg support to Qwen3Moe**

> Config used

PL =128

CL=128*3

<img width="726" height="1077" alt="image"
src="https://github.com/user-attachments/assets/7b9afa00-8505-4df5-9a91-68b55e89b416"
/>

---------

Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Summary
- Keep `use_onnx_subfunctions` disabled by default in
`QEfficient.cloud.infer`
- Provide explicit opt-in via `--use-onnx-subfunctions` only
- Remove `--no-use-onnx-subfunctions`
- Update infer unit tests for explicit-enable and default-disabled
behavior
- Update quick-start and text-generation docs to reflect explicit opt-in
behavior

Why
- Align infer behavior with reviewer feedback to keep defaults unchanged
and avoid model-specific auto-enable behavior.

Fixes
- Fixes quic#702

Validation
- `python -m py_compile QEfficient/cloud/infer.py
tests/cloud/test_infer.py`
- `ruff check QEfficient/cloud/infer.py tests/cloud/test_infer.py`
- `pytest -q tests/cloud/test_infer.py -m "not on_qaic"` (2 passed, 5
deselected)

---------

Signed-off-by: jd316 <jd316biswas@gmail.com>
Removed following packages from pyproject.toml
multidict==6.0.4
urllib3<2

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Pytest unit tests designed as a preflight before submitting a PR. Runs
fully on CPU and focuses on module level testing, transformation
correctness, and accuracy comparison between HF, transformed HF, and ORT
for representative models.

---------

Signed-off-by: Rishin Raj <rishinr@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Co-authored-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <quic_dipankar@quicinc.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Signed-off-by: Onkar Chougule <ochougul@qti.qualcomm.com>
Co-authored-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
@qcdipankar qcdipankar changed the base branch from qwen3_vl_mainline to main March 17, 2026 19:02
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.