Add Qwen3VL Eagle3 Inference Support#17276
Conversation
Summary of ChangesHello @ardenma, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates EAGLE3 speculative decoding capabilities into the Qwen3VL model within the SGLang framework. The changes primarily involve adapting the model's forward pass to manage auxiliary hidden states required by EAGLE3 and introducing new methods to control which intermediate layers' outputs are captured for this purpose. This enhancement aims to improve inference efficiency for Qwen3VL by leveraging speculative decoding. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request adds support for EAGLE3 speculative decoding to the Qwen3VL model. The changes correctly introduce the necessary mechanisms to capture and handle auxiliary hidden states from intermediate layers, which is a key requirement for EAGLE3. The implementation includes adding a control flag, modifying the forward pass to propagate the states, and adding helper methods for configuration. The code is well-structured and follows existing patterns in the codebase. I have one minor suggestion to improve code readability.
| if hasattr(self.config, "text_config") and self.config.text_config is not None: | ||
| num_layers = self.config.text_config.num_hidden_layers | ||
| else: | ||
| num_layers = self.config.num_hidden_layers |
There was a problem hiding this comment.
The logic to determine num_layers can be simplified by using a conditional expression to get the correct text_config first. This avoids the if/else block and makes the code more concise and easier to read.
text_config = self.config.text_config if hasattr(self.config, "text_config") and self.config.text_config is not None else self.config
num_layers = text_config.num_hidden_layers|
/tag-and-rerun-ci |
|
Great work! |
|
@ardenma Great work! Could you help check this feedback? #17935 (comment) Thanks~ |
|
Hi, thanks for this PR. I’ve recently been trying speculative decoding on Qwen3-VL as well, and I observed the same issue in both my own fork and your PR: after enabling speculative decoding, the output becomes different from the baseline output. Interestingly, I did not observe this issue on Qwen2.5-VL under similar testing, so I wanted to ask: Have you also seen this behavior on Qwen3-VL (output mismatch when speculative decoding is enabled)? Environment
Launch CommandsBaseline CUDA_VISIBLE_DEVICES=1 \
python -m sglang.launch_server \
--model-path /models/Qwen3-VL-8B-Instruct \
--tp-size 2 \
--dtype bfloat16 \
--mem-fraction-static 0.75 \
--cuda-graph-max-bs 32 \
--context-length 40960 \
--port 30000EAGLE3 (speculative decoding enabled) CUDA_VISIBLE_DEVICES=1 \
python -m sglang.launch_server \
--model-path /models/Qwen3-VL-8B-Instruct \
--speculative-algorithm EAGLE3 \
--speculative-draft-model-path /models/Qwen3-VL-8B-Instruct-Eagle3-bak/ \
--tp-size 1 \
--dtype bfloat16 \
--mem-fraction-static 0.75 \
--cuda-graph-max-bs 32 \
--context-length 40960Test MethodI used the test file from this PR diff: ResultUsing a single MMStar sample, the outputs are completely inconsistent between baseline and EAGLE3 (including the final answer). Baseline output (full)EAGLE3 output (full)In my case, this is not just a stylistic difference — it changes the final answer. If you have seen this as well, do you have any ideas about the root cause (e.g. VL-specific preprocessing / vision token handling / speculative verification path differences for Qwen3-VL)? |
Motivation
Add support for EAGLE3 for Qwen3VL. I have not extensively tried to optimize benchmark performance, but the accuracy/performance numbers at the bottom are there to show that inference works.
Modifications
Adds a few things to qwen3vl model:
get_embed_and_headset_eagle3_layers_to_capturelogits_processorStarts hidden state capture after the deepstack injection to avoid mismatch with specforge training code. In sglang we capture the hidden states after the
hidden_states + residualbut specforge/hf captures afterhidden_states + residual + deepstack. We could probably fix the timing sync... but this is less error prone.Accuracy Test
See sgl-project/SpecForge#431 for the specforge training implementation and results.
Benchmark & Profiling
Sglang command:
python3 -m sglang.launch_server --model Qwen/Qwen3-VL-8B-Instruct --speculative-algorithm EAGLE3 --speculative-draft-model-path ../SpecForge/outputs/Qwen3-VL-8B-eagle3/epoch_11_step_140000 --speculative-num-steps 6 --speculative-eagle-topk 4 --spec ulative-num-draft-tokens 16 --mem-fraction-static 0.75 --cuda-graph-max-bs 1 --tp 1 --trust-remote-code --host 0.0.0.0 --port 30000 --dtype bfloat16Benchmark command:
python3 benchmarks/bench_eagle3.py --model-path Qwen/Qwen3-VL-8B-Instruct --port 30000 --config-list 1,0,0,0 1,3,1,4 --benchmark-list mmstar:100 --skip-launch-serverChecklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci