Skip to content

feat: support qwen3-next on npu device.#945

Draft
liyu119 wants to merge 7 commits intojd-opensource:mainfrom
liyu119:feat-qwen3-next-pr
Draft

feat: support qwen3-next on npu device.#945
liyu119 wants to merge 7 commits intojd-opensource:mainfrom
liyu119:feat-qwen3-next-pr

Conversation

@liyu119
Copy link
Copy Markdown
Contributor

@liyu119 liyu119 commented Feb 26, 2026

  1. support qwen3-next model on npu
  2. add linear attention cache
  3. add triton kernel api, which depends on triton kernel ops merged in torch_npu_ops

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the qwen3-next model on NPU devices, introducing new model architecture files, custom kernels, and updates to KV cache management for linear attention. The changes are extensive and well-structured. However, I've identified a few critical issues related to incorrect memory allocation for the new caches and a constructor signature mismatch that would lead to compilation failure. These issues need to be addressed to ensure correctness and allow the code to compile.

Comment on lines +290 to +291
int64_t head_k_dim = args_.linear_value_head_dim();
int64_t head_v_dim = args_.linear_key_head_dim();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There appears to be a variable naming swap here. head_k_dim is being initialized with linear_value_head_dim, and head_v_dim with linear_key_head_dim. This is likely to cause incorrect calculations for linear_ssm_slot_size and linear_conv_slot_size, leading to memory allocation errors or incorrect behavior. Please swap the initializations to match the variable names.

Suggested change
int64_t head_k_dim = args_.linear_value_head_dim();
int64_t head_v_dim = args_.linear_key_head_dim();
int64_t head_k_dim = args_.linear_key_head_dim();
int64_t head_v_dim = args_.linear_value_head_dim();

args_.linear_key_head_dim() * n_local_linear_v_heads_, args_.linear_conv_kernel_dim() - 1});
kv_cache_shape.emplace_back(std::vector<int64_t>{
kv_cache_cap.n_blocks, n_local_linear_v_heads_, args_.linear_key_head_dim(),
args_.linear_key_head_dim()});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The shape for the SSM cache appears to be incorrect. Both the third and fourth dimensions are set to args_.linear_key_head_dim(). The SSM state typically has dimensions corresponding to key and value head dimensions (k_dim, v_dim). It should likely be args_.linear_value_head_dim() for the last dimension to correctly represent the state.

Suggested change
args_.linear_key_head_dim()});
args_.linear_value_head_dim()});

Comment on lines +34 to +37
KVCache(torch::Tensor key_cache,
torch::Tensor value_cache,
torch::Tensor conv_cache,
torch::Tensor ssm_cache);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This new constructor for KVCache takes four torch::Tensor arguments. However, it is being called with five arguments (key_cache, value_cache, index_cache, conv_cache, ssm_cache) in xllm/core/runtime/worker_impl.cpp on line 148. This will cause a compilation error. The constructor should be updated to accept all five tensors to correctly initialize all cache types. The implementation in kv_cache.cpp will also need to be updated to initialize all five members.

  KVCache(torch::Tensor key_cache,
          torch::Tensor value_cache,
          torch::Tensor index_cache,
          torch::Tensor conv_cache,
          torch::Tensor ssm_cache);

}
#endif
kv_caches_.emplace_back(key_cache, value_cache, index_cache);
kv_caches_.emplace_back(key_cache, value_cache, index_cache, conv_cache, ssm_cache);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line attempts to construct a KVCache object with five arguments. However, there is no matching constructor defined for KVCache that accepts five tensors. The newly added constructor in kv_cache.h only takes four arguments. This will result in a compilation error. Please ensure the KVCache class has a constructor that matches this call.

@@ -0,0 +1,44 @@
/* Copyright 2025 The xLLM Authors. All Rights Reserved.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

place this file to models/llm/npu.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

放在 models/llm/ 下面吧,这个是 torch 组图通用的,不是atb 组图

@yingxudeng yingxudeng marked this pull request as draft February 26, 2026 12:35

// qwen3 next
PROPERTY(bool, attn_output_gate) = true;
PROPERTY(int32_t, full_attention_interval) = 4;
Copy link
Copy Markdown
Contributor

@JC-ut0 JC-ut0 Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

default value of full_attention_interval should be set to 1, in case other models don't have this config can behave correctly

@yingxudeng yingxudeng changed the title feat: support qwen3-next on npu device feat: support qwen3-next on npu device. Feb 27, 2026
return padded_qkvz;
}
std::vector<torch::Tensor> valid_batches;
int64_t bs = attn_metadata.query_start_loc.size(0);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qwen3_next_gated_delta_net.cpp:418:32: error: ‘const struct xllm::layer::AttentionMetadata’ has no member named ‘query_start_loc’
418 | int64_t bs = attn_metadata.query_start_loc.size(0);

torch::Tensor& weight,
bool& weight_is_loaded);

void load_merged_weight_v2(const StateDict& state_dict,
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#define DEFINE_MERGED_WEIGHT_V2(name) \

std::vector<torch::Tensor> valid_batches;
int64_t bs = attn_metadata.query_start_loc.size(0);
int64_t max_len = attn_metadata.max_query_len;
const auto& ori_seq_lens = attn_metadata.query_start_loc;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qwen3_next_gated_delta_net.cpp:420:46: error: ‘const struct xllm::layer::AttentionMetadata’ has no member named ‘query_start_loc’
420 | const auto& ori_seq_lens = attn_metadata.query_start_loc;

Comment thread xllm/models/llm/qwen3_next.h Outdated
}

private:
layer::Qwen3NextDecoderLayer decoder_layer_{nullptr};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Qwen3NextDecoderLayer’ in namespace ‘xllm::layer’ does not name a type;

@yingxudeng
Copy link
Copy Markdown
Collaborator

上一个moe的pr,
xllm/core/layers/npu/fused_moe.cpp 这个文件放错地方了,
应该放在 xllm/xllm/core/layers/npu_torch/fused_moe.cpp 中,
后面我挪下

}
#endif
kv_caches_.emplace_back(key_cache, value_cache, index_cache);
kv_caches_.emplace_back(key_cache, value_cache, index_cache, conv_cache, ssm_cache);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why there have five arguments?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants