Skip to content

feat: support Qwen3-next on npu device.#989

Merged
yingxudeng merged 24 commits intojd-opensource:mainfrom
JC-ut0:qwen-next
Mar 24, 2026
Merged

feat: support Qwen3-next on npu device.#989
yingxudeng merged 24 commits intojd-opensource:mainfrom
JC-ut0:qwen-next

Conversation

@JC-ut0
Copy link
Copy Markdown
Contributor

@JC-ut0 JC-ut0 commented Mar 4, 2026

  1. Support Qwen3-next on NPU device, add linear attention cache.
  2. Add triton kernel api, which depends on the merging of feat: adapt for CANN 8.5 and PyTorch 2.7.1 for npu device. #891 .
  3. Modified from feat: support qwen3-next on npu device. #945, to resolve merging conflicts and bugs.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the 'Qwen next' model, involving extensive changes across the build system, environment setup, and core C++ components, including new layers, kernels, and model arguments. A critical security vulnerability has been identified where user-supplied data in RPC requests is validated using CHECK macros, creating a Denial of Service (DoS) attack vector by allowing malformed requests to crash worker processes. It is strongly recommended to replace these CHECK macros with proper error validation and return error statuses. Furthermore, a critical issue exists in the KV cache capacity estimation logic where variable names for key and value head dimensions are swapped, potentially leading to incorrect memory allocation and runtime failures.

Comment thread xllm/core/distributed_runtime/llm_engine.cpp Outdated
Comment thread xllm/core/distributed_runtime/worker_service.cpp Outdated
Comment thread xllm/core/distributed_runtime/worker_service.cpp
Comment thread xllm/core/distributed_runtime/worker_service.cpp Outdated
Comment thread xllm/core/distributed_runtime/worker_service.cpp
@yingxudeng yingxudeng marked this pull request as draft March 4, 2026 02:43
@JC-ut0 JC-ut0 changed the title [Draft] Support Qwen next [Draft] Support Qwen3-next on NPU device Mar 4, 2026
Comment thread xllm/core/layers/npu/fused_moe.cpp Outdated
Comment thread CMakeLists.txt Outdated
Comment thread CMakeLists.txt
@JC-ut0 JC-ut0 force-pushed the qwen-next branch 4 times, most recently from a3e3901 to 0bc39a0 Compare March 5, 2026 09:11
@JC-ut0
Copy link
Copy Markdown
Contributor Author

JC-ut0 commented Mar 5, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the "Qwen3-next" model on NPU devices. A high-severity Denial of Service (DoS) vulnerability has been identified in the RPC handlers of the "WorkerService", where "CHECK" macros used for input validation can cause the worker process to abort on invalid input, allowing remote attackers to crash the worker. Additionally, two critical bugs were found in the cache allocation logic: a typo in the "SSM" cache shape definition and a copy-paste error when handling cache shapes in the worker service. These issues need to be addressed to ensure both correctness and security, specifically by replacing "CHECK" macros with graceful error handling.

Comment thread xllm/core/distributed_runtime/llm_engine.cpp Outdated
Comment thread xllm/core/distributed_runtime/worker_service.cpp Outdated
Comment thread xllm/core/distributed_runtime/worker_service.cpp Outdated
@JC-ut0 JC-ut0 changed the title [Draft] Support Qwen3-next on NPU device feat: Support Qwen3-next on NPU device Mar 10, 2026
@JC-ut0 JC-ut0 marked this pull request as ready for review March 10, 2026 01:45
@XuZhang99 XuZhang99 changed the title feat: Support Qwen3-next on NPU device feat: support Qwen3-next on npu device. Mar 12, 2026
@JC-ut0
Copy link
Copy Markdown
Contributor Author

JC-ut0 commented Mar 14, 2026

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for the Qwen3-next model on NPU devices, which includes adding a linear attention cache. The changes are extensive, involving new model layers, kernels, and updates to the build system and data structures. My review identified a critical compilation error related to incorrect pointer access and a couple of high-severity issues where function signatures could lead to unexpected side effects by modifying input tensors. I have provided code suggestions to address these problems.

Comment thread xllm/core/distributed_runtime/comm_channel.cpp Outdated
Comment thread xllm/core/layers/common/qwen3_next_rms_norm.cpp
Comment thread xllm/core/layers/common/rms_norm_gated.cpp Outdated
Comment thread xllm/models/llm/qwen3_next.h Outdated
Comment thread xllm/models/llm/qwen3_next.h Outdated
Comment thread xllm/models/llm/qwen3_next.h Outdated
Comment thread xllm/core/kernels/CMakeLists.txt Outdated
Comment thread CMakeLists.txt Outdated
Comment thread xllm/core/framework/kv_cache/kv_cache.h
Comment thread xllm/core/framework/kv_cache/kv_cache.h Outdated
Comment thread xllm/core/distributed_runtime/llm_engine.cpp
Comment thread xllm/core/distributed_runtime/llm_engine.cpp
Comment thread xllm/core/layers/common/partial_rotary_embedding.cpp Outdated
Comment thread xllm/models/llm/qwen3_next.h Outdated
Comment thread xllm/models/llm/qwen3_next.h Outdated
Comment thread xllm/core/distributed_runtime/comm_channel.cpp Outdated
Comment thread xllm/core/distributed_runtime/comm_channel.cpp Outdated
Comment thread xllm/core/distributed_runtime/llm_engine.cpp
Comment thread xllm/core/distributed_runtime/llm_engine.cpp
Comment thread xllm/core/distributed_runtime/llm_engine.cpp
Comment thread xllm/core/distributed_runtime/worker_service.cpp
Comment thread xllm/core/distributed_runtime/worker_service.cpp Outdated
Comment thread xllm/core/framework/state_dict/utils.cpp
Comment thread xllm/core/layers/npu_torch/fused_moe.cpp
Comment thread xllm/core/distributed_runtime/comm_channel.cpp Outdated
Comment thread xllm/core/distributed_runtime/engine.h
Comment thread xllm/core/distributed_runtime/llm_engine.cpp Outdated
Comment thread xllm/core/distributed_runtime/worker_service.cpp Outdated
Comment thread xllm/core/framework/parallel_state/npu_process_group.cpp
Comment thread xllm/core/layers/common/attention_metadata_builder.cpp Outdated
Comment thread xllm/core/runtime/worker_impl.cpp
DragonFive
DragonFive previously approved these changes Mar 20, 2026
@yingxudeng
Copy link
Copy Markdown
Collaborator

yingxudeng commented Mar 23, 2026

image CI/CD jobs keep hanging then getting skipped. Re-triggering manually, hoping it runs clean this time.

Copy link
Copy Markdown
Collaborator

@yq33victor yq33victor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yingxudeng yingxudeng merged commit d135444 into jd-opensource:main Mar 24, 2026
49 of 81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants