Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
|
Thank you for the PR! Could you help verify all inferences (vLLM, Transformers 4, and Transformers 5) before merging? |
for more information, see https://pre-commit.ci
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Quantize:Inference with transformers 5.1.0
vLLM tests are currently blocked because the latest vLLM version depends on an outdated Transformers release. Qwen3-Omni requires Transformers >= 5.1.0 to address several known issues. |
There was a problem hiding this comment.
Pull request overview
Adds quantization support for the Qwen3-Omni MoE model family by integrating model-specific loading/version gating, calibration forward behavior for thinker/talker, and custom multimodal block discovery.
Changes:
- Added explicit Transformers version guard for
qwen3_omni_moe. - Introduced Qwen3-Omni processor/template registration and model-specific multimodal block name discovery.
- Implemented a Qwen3-Omni-specific forward path to run thinker (and optionally talker) during calibration.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| pyproject.toml | Adds a project-specific word to typos’ allowlist. |
| auto_round/utils/model.py | Adds Transformers version guard and adjusts lm_head discovery logic. |
| auto_round/utils/common.py | Adds _no_split_modules normalization and extends multimodal ignore-key lists. |
| auto_round/special_model_handler.py | Adds Qwen3-Omni special forward + block discovery + ignore-layer rule. |
| auto_round/compressors/shard_writer.py | Improves tie_word_embeddings lookup for nested multimodal configs. |
| auto_round/compressors/mllm/utils.py | Extends multimodal ignore-key list for Qwen3-Omni components. |
| auto_round/compressors/mllm/template.py | Registers a Qwen3-Omni model template with the new processor. |
| auto_round/compressors/mllm/processor.py | Adds a custom processor for Qwen3-Omni chat-template inputs. |
| auto_round/compressors/base.py | Imports the new normalization helper. |
| auto_round/auto_scheme/utils.py | Uses normalized _no_split_modules when dispatching across devices. |
| ) | ||
|
|
||
| # Run talker forward if available (for calibration purposes) | ||
| if hasattr(model, "talker") and model.has_talker: |
There was a problem hiding this comment.
This can raise AttributeError when model.has_talker doesn’t exist (the hasattr only checks talker). Use getattr(model, "has_talker", False) (and optionally also ensure model.talker is not None) to make this guard safe.
| if hasattr(model, "talker") and model.has_talker: | |
| if getattr(model, "has_talker", False) and getattr(model, "talker", None) is not None: |
| # Use text projection to convert thinker embeddings to talker space | ||
| if hasattr(model.talker, "text_projection"): | ||
| # Get thinker embeddings | ||
| thinker_embeds = model.thinker.get_input_embeddings()(input_ids) | ||
| talker_inputs_embeds = model.talker.text_projection(thinker_embeds) |
There was a problem hiding this comment.
This path assumes input_ids is provided; if calibration runs with inputs_embeds (or other modalities without input_ids), this will throw and then be silently ignored (due to the broad except), meaning the talker forward never runs. Consider deriving inputs from inputs_embeds when present, or projecting from thinker_output.hidden_states[-1] (which you already compute) instead of re-embedding input_ids.
| # Use text projection to convert thinker embeddings to talker space | |
| if hasattr(model.talker, "text_projection"): | |
| # Get thinker embeddings | |
| thinker_embeds = model.thinker.get_input_embeddings()(input_ids) | |
| talker_inputs_embeds = model.talker.text_projection(thinker_embeds) | |
| # Use text projection to convert thinker hidden states to talker space | |
| if hasattr(model.talker, "text_projection"): | |
| # Project thinker hidden states directly into the talker embedding space | |
| talker_inputs_embeds = model.talker.text_projection(thinker_hidden) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Description
This update adds quantization support for Qwen3-Omni by integrating a custom MLLM processor and template, implementing dedicated forward logic for thinker/talker calibration, and introducing model-specific block discovery.
Note: This feature requires Transformers >= 5.1.0, as earlier versions contain compatibility issues with Qwen3-Omni.
Type of Change
Related Issues
#1387
Fixes or relates to #
Checklist Before Submitting