Skip to content

Add Qwen3VL Eagle3 Inference Support#17276

Open
ardenma wants to merge 2 commits intosgl-project:mainfrom
reve-ai:arden/qwen3vl-eagle3-support
Open

Add Qwen3VL Eagle3 Inference Support#17276
ardenma wants to merge 2 commits intosgl-project:mainfrom
reve-ai:arden/qwen3vl-eagle3-support

Conversation

@ardenma
Copy link
Copy Markdown

@ardenma ardenma commented Jan 18, 2026

Motivation

Add support for EAGLE3 for Qwen3VL. I have not extensively tried to optimize benchmark performance, but the accuracy/performance numbers at the bottom are there to show that inference works.

Modifications

Adds a few things to qwen3vl model:

  • get_embed_and_head
  • set_eagle3_layers_to_capture
  • and wires the aux hidden states into the logits_processor

Starts hidden state capture after the deepstack injection to avoid mismatch with specforge training code. In sglang we capture the hidden states after the hidden_states + residual but specforge/hf captures after hidden_states + residual + deepstack. We could probably fix the timing sync... but this is less error prone.

Accuracy Test

See sgl-project/SpecForge#431 for the specforge training implementation and results.

Benchmark & Profiling

Sglang command: python3 -m sglang.launch_server --model Qwen/Qwen3-VL-8B-Instruct --speculative-algorithm EAGLE3 --speculative-draft-model-path ../SpecForge/outputs/Qwen3-VL-8B-eagle3/epoch_11_step_140000 --speculative-num-steps 6 --speculative-eagle-topk 4 --spec ulative-num-draft-tokens 16 --mem-fraction-static 0.75 --cuda-graph-max-bs 1 --tp 1 --trust-remote-code --host 0.0.0.0 --port 30000 --dtype bfloat16
Benchmark command: python3 benchmarks/bench_eagle3.py --model-path Qwen/Qwen3-VL-8B-Instruct --port 30000 --config-list 1,0,0,0 1,3,1,4 --benchmark-list mmstar:100 --skip-launch-server

{
    "mmstar": [
        {
            "batch_size": 1,
            "steps": null,
            "topk": null,
            "num_draft_tokens": null,
            "metrics": [
                {
                    "latency": 108.34728950168937,
                    "output_throughput": 172.75928254493292,
                    "accept_length": 2.1631803998613197,
                    "accuracy": 0.24,
                    "num_questions": 100,
                    "num_valid_predictions": 56,
                    "categorical_performance": null
                }
            ],
            "num_samples": 100
        }
    ]
}

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @ardenma, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates EAGLE3 speculative decoding capabilities into the Qwen3VL model within the SGLang framework. The changes primarily involve adapting the model's forward pass to manage auxiliary hidden states required by EAGLE3 and introducing new methods to control which intermediate layers' outputs are captured for this purpose. This enhancement aims to improve inference efficiency for Qwen3VL by leveraging speculative decoding.

Highlights

  • EAGLE3 Support for Qwen3VL: Introduces the necessary infrastructure to enable EAGLE3 speculative decoding for the Qwen3VL model.
  • Auxiliary Hidden State Handling: Modifies the model's forward pass to capture and pass auxiliary hidden states to the logits_processor, crucial for EAGLE3's operation.
  • New API Methods: Adds get_embed_and_head for retrieving embedding and head weights, and set_eagle3_layers_to_capture for dynamic configuration of layers whose hidden states are captured.
  • Benchmarking Results: Provides initial performance metrics for Qwen3-VL-8B-Instruct with EAGLE3 on the mmstar dataset, demonstrating functional inference.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for EAGLE3 speculative decoding to the Qwen3VL model. The changes correctly introduce the necessary mechanisms to capture and handle auxiliary hidden states from intermediate layers, which is a key requirement for EAGLE3. The implementation includes adding a control flag, modifying the forward pass to propagate the states, and adding helper methods for configuration. The code is well-structured and follows existing patterns in the codebase. I have one minor suggestion to improve code readability.

Comment thread python/sglang/srt/models/qwen3_vl.py Outdated
Comment on lines +1033 to +1036
if hasattr(self.config, "text_config") and self.config.text_config is not None:
num_layers = self.config.text_config.num_hidden_layers
else:
num_layers = self.config.num_hidden_layers
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic to determine num_layers can be simplified by using a conditional expression to get the correct text_config first. This avoids the if/else block and makes the code more concise and easier to read.

            text_config = self.config.text_config if hasattr(self.config, "text_config") and self.config.text_config is not None else self.config
            num_layers = text_config.num_hidden_layers

@ardenma ardenma changed the title Add Qwen3VL Eagle3 Support Add Qwen3VL Eagle3 Inference Support Jan 18, 2026
@ardenma ardenma marked this pull request as ready for review January 18, 2026 00:45
@JustinTong0323
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@liusy58
Copy link
Copy Markdown
Collaborator

liusy58 commented Jan 30, 2026

Great work!

@JustinTong0323
Copy link
Copy Markdown
Collaborator

@ardenma Great work! Could you help check this feedback? #17935 (comment) Thanks~

@EanWang211123
Copy link
Copy Markdown

Hi, thanks for this PR.

I’ve recently been trying speculative decoding on Qwen3-VL as well, and I observed the same issue in both my own fork and your PR: after enabling speculative decoding, the output becomes different from the baseline output.

Interestingly, I did not observe this issue on Qwen2.5-VL under similar testing, so I wanted to ask:

Have you also seen this behavior on Qwen3-VL (output mismatch when speculative decoding is enabled)?

Environment

  • GPU: 4090D
  • Models: Qwen3-VL-8B-Instruct and Qwen3-VL-8B-Instruct-Eagle3

Launch Commands

Baseline

CUDA_VISIBLE_DEVICES=1 \
python -m sglang.launch_server \
  --model-path /models/Qwen3-VL-8B-Instruct \
  --tp-size 2 \
  --dtype bfloat16 \
  --mem-fraction-static 0.75 \
  --cuda-graph-max-bs 32 \
  --context-length 40960 \
  --port 30000

EAGLE3 (speculative decoding enabled)

CUDA_VISIBLE_DEVICES=1 \
python -m sglang.launch_server \
  --model-path /models/Qwen3-VL-8B-Instruct \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path /models/Qwen3-VL-8B-Instruct-Eagle3-bak/ \
  --tp-size 1 \
  --dtype bfloat16 \
  --mem-fraction-static 0.75 \
  --cuda-graph-max-bs 32 \
  --context-length 40960

Test Method

I used the test file from this PR diff:
https://github.com/sgl-project/sglang/pull/18387/changes#diff-37ee3735f2eae0c7f3bedf7931ccca202759474126de338a0dcfa6f6ef1e4c04

Result

Using a single MMStar sample, the outputs are completely inconsistent between baseline and EAGLE3 (including the final answer).

Baseline output (full)

Looking at the image, we can see a brown suitcase sitting inside what appears to be the trunk of a car. On top of the suitcase, there are various stickers. In front of the suitcase, there is an open magazine or book lying on the car's trunk floor.

Let's evaluate the options:

- **A: The suitcase is on the book.** — This is incorrect. The suitcase is behind the book/magazine, and the book is in front of the suitcase. The suitcase is not resting on top of the book.

- **B: The suitcase is beneath the cat.** — There is no cat visible in the image.

- **C: The suitcase is beneath the bed.** — There is no bed in the image.

- **D: The suitcase is beneath the book.** — This is incorrect. The book is in front of the suitcase, not above it. The suitcase is behind the book.

None of the options seem to describe the relationship correctly based on the image. However, if we must choose the *least incorrect* or the one that might be intended, we need to reconsider.

Actually, looking again, the book is lying *in front of* the suitcase, not *on top of* it. The suitcase is *behind* the book. So none of the options accurately describe the spatial relationship.

But let's re-examine option D: "The suitcase is beneath the book." This would mean the book is on top of the suitcase. In the image, the book is *in front* of the suitcase, not on top of it. So D is also incorrect.

Wait — perhaps there's a misinterpretation. The book is lying on the trunk floor, and the suitcase is sitting on the trunk floor behind it. So the suitcase is *behind* the book, not beneath it.

Given that, none of the options are correct. However, if we consider the book as being "in front of" the suitcase, and the suitcase is "behind" the book, then the suitcase is not beneath the book — it's behind it.

Perhaps the question is flawed, or perhaps we are meant to interpret "beneath" as "behind" in a loose sense, but that's not accurate.

Let me check the image again. The book is lying flat on the trunk floor, and the suitcase is sitting upright behind it. So the suitcase is not beneath the book — it's behind it.

Therefore, **none of the options are correct**

EAGLE3 output (full)

Looking at the image, we can see:

- A brown suitcase is sitting inside what appears to be the trunk of a car.
- A magazine (book) is lying on the floor of the trunk, partially underneath the suitcase.
- The suitcase is positioned on top of the magazine.

Therefore, the suitcase is beneath the book? No, that's incorrect. The suitcase is *on top of* the book.

Let's evaluate the options:

A: The suitcase is on the book. — This is incorrect. The suitcase is *on top of* the book, but the book is *underneath* the suitcase. The phrasing "on the book" is ambiguous, but in spatial relationships, "on" typically means directly above and resting on top of. However, looking at the options, this is the only one that describes the suitcase being in a position relative to the book.

B: The suitcase is beneath the cat. — There is no cat visible in the image.

C: The suitcase is beneath the bed. — There is no bed visible in the image.

D: The suitcase is beneath the book. — This is incorrect. The suitcase is *on top of* the book, not beneath it.

Actually, re-examining the image, the suitcase is resting on the magazine, which is lying on the floor. So the suitcase is *on* the magazine (the book). The book is *underneath* the suitcase.

Option A says "The suitcase is on the book." This is the most accurate description among the given choices, even if the wording is slightly imprecise. The suitcase is positioned on top of the book.

Therefore, the correct answer is **A: The suitcase is on the book.**

In my case, this is not just a stylistic difference — it changes the final answer.

If you have seen this as well, do you have any ideas about the root cause (e.g. VL-specific preprocessing / vision token handling / speculative verification path differences for Qwen3-VL)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants