Add Qwen3VL Eagle3 Inference Support by ardenma · Pull Request #17276 · sgl-project/sglang

ardenma · 2026-01-18T00:20:01Z

Motivation

Add support for EAGLE3 for Qwen3VL. I have not extensively tried to optimize benchmark performance, but the accuracy/performance numbers at the bottom are there to show that inference works.

Modifications

Adds a few things to qwen3vl model:

get_embed_and_head
set_eagle3_layers_to_capture
and wires the aux hidden states into the logits_processor

Starts hidden state capture after the deepstack injection to avoid mismatch with specforge training code. In sglang we capture the hidden states after the hidden_states + residual but specforge/hf captures after hidden_states + residual + deepstack. We could probably fix the timing sync... but this is less error prone.

Accuracy Test

See sgl-project/SpecForge#431 for the specforge training implementation and results.

Benchmark & Profiling

Sglang command: python3 -m sglang.launch_server --model Qwen/Qwen3-VL-8B-Instruct --speculative-algorithm EAGLE3 --speculative-draft-model-path ../SpecForge/outputs/Qwen3-VL-8B-eagle3/epoch_11_step_140000 --speculative-num-steps 6 --speculative-eagle-topk 4 --spec ulative-num-draft-tokens 16 --mem-fraction-static 0.75 --cuda-graph-max-bs 1 --tp 1 --trust-remote-code --host 0.0.0.0 --port 30000 --dtype bfloat16
Benchmark command: python3 benchmarks/bench_eagle3.py --model-path Qwen/Qwen3-VL-8B-Instruct --port 30000 --config-list 1,0,0,0 1,3,1,4 --benchmark-list mmstar:100 --skip-launch-server

{
    "mmstar": [
        {
            "batch_size": 1,
            "steps": null,
            "topk": null,
            "num_draft_tokens": null,
            "metrics": [
                {
                    "latency": 108.34728950168937,
                    "output_throughput": 172.75928254493292,
                    "accept_length": 2.1631803998613197,
                    "accuracy": 0.24,
                    "num_questions": 100,
                    "num_valid_predictions": 56,
                    "categorical_performance": null
                }
            ],
            "num_samples": 100
        }
    ]
}

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-01-18T00:20:16Z

Summary of Changes

Hello @ardenma, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates EAGLE3 speculative decoding capabilities into the Qwen3VL model within the SGLang framework. The changes primarily involve adapting the model's forward pass to manage auxiliary hidden states required by EAGLE3 and introducing new methods to control which intermediate layers' outputs are captured for this purpose. This enhancement aims to improve inference efficiency for Qwen3VL by leveraging speculative decoding.

Highlights

EAGLE3 Support for Qwen3VL: Introduces the necessary infrastructure to enable EAGLE3 speculative decoding for the Qwen3VL model.
Auxiliary Hidden State Handling: Modifies the model's forward pass to capture and pass auxiliary hidden states to the logits_processor, crucial for EAGLE3's operation.
New API Methods: Adds get_embed_and_head for retrieving embedding and head weights, and set_eagle3_layers_to_capture for dynamic configuration of layers whose hidden states are captured.
Benchmarking Results: Provides initial performance metrics for Qwen3-VL-8B-Instruct with EAGLE3 on the mmstar dataset, demonstrating functional inference.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds support for EAGLE3 speculative decoding to the Qwen3VL model. The changes correctly introduce the necessary mechanisms to capture and handle auxiliary hidden states from intermediate layers, which is a key requirement for EAGLE3. The implementation includes adding a control flag, modifying the forward pass to propagate the states, and adding helper methods for configuration. The code is well-structured and follows existing patterns in the codebase. I have one minor suggestion to improve code readability.

gemini-code-assist · 2026-01-18T00:21:41Z

+            if hasattr(self.config, "text_config") and self.config.text_config is not None:
+                num_layers = self.config.text_config.num_hidden_layers
+            else:
+                num_layers = self.config.num_hidden_layers


The logic to determine num_layers can be simplified by using a conditional expression to get the correct text_config first. This avoids the if/else block and makes the code more concise and easier to read.

text_config = self.config.text_config if hasattr(self.config, "text_config") and self.config.text_config is not None else self.config num_layers = text_config.num_hidden_layers

JustinTong0323 · 2026-01-29T16:54:06Z

/tag-and-rerun-ci

liusy58 · 2026-01-30T07:22:23Z

Great work!

JustinTong0323 · 2026-01-31T14:27:25Z

@ardenma Great work! Could you help check this feedback? #17935 (comment) Thanks~

EanWang211123 · 2026-02-25T06:13:44Z

Hi, thanks for this PR.

I’ve recently been trying speculative decoding on Qwen3-VL as well, and I observed the same issue in both my own fork and your PR: after enabling speculative decoding, the output becomes different from the baseline output.

Interestingly, I did not observe this issue on Qwen2.5-VL under similar testing, so I wanted to ask:

Have you also seen this behavior on Qwen3-VL (output mismatch when speculative decoding is enabled)?

Environment

GPU: 4090D
Models: Qwen3-VL-8B-Instruct and Qwen3-VL-8B-Instruct-Eagle3

Launch Commands

Baseline

CUDA_VISIBLE_DEVICES=1 \
python -m sglang.launch_server \
  --model-path /models/Qwen3-VL-8B-Instruct \
  --tp-size 2 \
  --dtype bfloat16 \
  --mem-fraction-static 0.75 \
  --cuda-graph-max-bs 32 \
  --context-length 40960 \
  --port 30000

EAGLE3 (speculative decoding enabled)

CUDA_VISIBLE_DEVICES=1 \
python -m sglang.launch_server \
  --model-path /models/Qwen3-VL-8B-Instruct \
  --speculative-algorithm EAGLE3 \
  --speculative-draft-model-path /models/Qwen3-VL-8B-Instruct-Eagle3-bak/ \
  --tp-size 1 \
  --dtype bfloat16 \
  --mem-fraction-static 0.75 \
  --cuda-graph-max-bs 32 \
  --context-length 40960

Test Method

I used the test file from this PR diff:
https://github.com/sgl-project/sglang/pull/18387/changes#diff-37ee3735f2eae0c7f3bedf7931ccca202759474126de338a0dcfa6f6ef1e4c04

Result

Using a single MMStar sample, the outputs are completely inconsistent between baseline and EAGLE3 (including the final answer).

Baseline output (full)

Looking at the image, we can see a brown suitcase sitting inside what appears to be the trunk of a car. On top of the suitcase, there are various stickers. In front of the suitcase, there is an open magazine or book lying on the car's trunk floor.

Let's evaluate the options:

- **A: The suitcase is on the book.** — This is incorrect. The suitcase is behind the book/magazine, and the book is in front of the suitcase. The suitcase is not resting on top of the book.

- **B: The suitcase is beneath the cat.** — There is no cat visible in the image.

- **C: The suitcase is beneath the bed.** — There is no bed in the image.

- **D: The suitcase is beneath the book.** — This is incorrect. The book is in front of the suitcase, not above it. The suitcase is behind the book.

None of the options seem to describe the relationship correctly based on the image. However, if we must choose the *least incorrect* or the one that might be intended, we need to reconsider.

Actually, looking again, the book is lying *in front of* the suitcase, not *on top of* it. The suitcase is *behind* the book. So none of the options accurately describe the spatial relationship.

But let's re-examine option D: "The suitcase is beneath the book." This would mean the book is on top of the suitcase. In the image, the book is *in front* of the suitcase, not on top of it. So D is also incorrect.

Wait — perhaps there's a misinterpretation. The book is lying on the trunk floor, and the suitcase is sitting on the trunk floor behind it. So the suitcase is *behind* the book, not beneath it.

Given that, none of the options are correct. However, if we consider the book as being "in front of" the suitcase, and the suitcase is "behind" the book, then the suitcase is not beneath the book — it's behind it.

Perhaps the question is flawed, or perhaps we are meant to interpret "beneath" as "behind" in a loose sense, but that's not accurate.

Let me check the image again. The book is lying flat on the trunk floor, and the suitcase is sitting upright behind it. So the suitcase is not beneath the book — it's behind it.

Therefore, **none of the options are correct**

EAGLE3 output (full)

Looking at the image, we can see:

- A brown suitcase is sitting inside what appears to be the trunk of a car.
- A magazine (book) is lying on the floor of the trunk, partially underneath the suitcase.
- The suitcase is positioned on top of the magazine.

Therefore, the suitcase is beneath the book? No, that's incorrect. The suitcase is *on top of* the book.

Let's evaluate the options:

A: The suitcase is on the book. — This is incorrect. The suitcase is *on top of* the book, but the book is *underneath* the suitcase. The phrasing "on the book" is ambiguous, but in spatial relationships, "on" typically means directly above and resting on top of. However, looking at the options, this is the only one that describes the suitcase being in a position relative to the book.

B: The suitcase is beneath the cat. — There is no cat visible in the image.

C: The suitcase is beneath the bed. — There is no bed visible in the image.

D: The suitcase is beneath the book. — This is incorrect. The suitcase is *on top of* the book, not beneath it.

Actually, re-examining the image, the suitcase is resting on the magazine, which is lying on the floor. So the suitcase is *on* the magazine (the book). The book is *underneath* the suitcase.

Option A says "The suitcase is on the book." This is the most accurate description among the given choices, even if the wording is slightly imprecise. The suitcase is positioned on top of the book.

Therefore, the correct answer is **A: The suitcase is on the book.**

In my case, this is not just a stylistic difference — it changes the final answer.

If you have seen this as well, do you have any ideas about the root cause (e.g. VL-specific preprocessing / vision token handling / speculative verification path differences for Qwen3-VL)?

add eagle3 support for qwen3vl

25ef3f8

ardenma mentioned this pull request Jan 18, 2026

Add Qwen3VL Eagle3 Training Support sgl-project/SpecForge#431

Open

6 tasks

gemini-code-assist Bot reviewed Jan 18, 2026

View reviewed changes

ardenma changed the title ~~Add Qwen3VL Eagle3 Support~~ Add Qwen3VL Eagle3 Inference Support Jan 18, 2026

ardenma marked this pull request as ready for review January 18, 2026 00:45

lint

35cad09

JustinTong0323 mentioned this pull request Jan 29, 2026

[Bug] Qwen3-VL-30B-A3B (MoE) fails to start with speculative decoding: missing get_embed_and_head() / set_eagle3_layers_to_capture() in Qwen3VLMoeForConditionalGeneration (sglang v0.5.7) #17935

Closed

5 tasks

github-actions Bot added the run-ci label Jan 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qwen3VL Eagle3 Inference Support#17276

Add Qwen3VL Eagle3 Inference Support#17276
ardenma wants to merge 2 commits intosgl-project:mainfrom
reve-ai:arden/qwen3vl-eagle3-support

ardenma commented Jan 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot commented Jan 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jan 18, 2026

Uh oh!

JustinTong0323 commented Jan 29, 2026

Uh oh!

liusy58 commented Jan 30, 2026

Uh oh!

JustinTong0323 commented Jan 31, 2026

Uh oh!

EanWang211123 commented Feb 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ardenma commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Test

Benchmark & Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist Bot commented Jan 18, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 commented Jan 29, 2026

Uh oh!

liusy58 commented Jan 30, 2026

Uh oh!

JustinTong0323 commented Jan 31, 2026

Uh oh!

EanWang211123 commented Feb 25, 2026

Environment

Launch Commands

Test Method

Result

Baseline output (full)

EAGLE3 output (full)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ardenma commented Jan 18, 2026 •

edited

Loading