feat: add wrappers for ATB and ACLNN fused operators. #474

yingxudeng · 2025-12-02T15:06:11Z

No description provided.

XuZhang99 · 2025-12-02T15:14:56Z

xllm/core/kernels/ops_api.cpp

 #elif defined(USE_CUDA)
  cuda::act_and_mul(params.output, params.input, params.act_mode);
 #else
  LOG(FATAL) << "active not implemented";


remove torch::Tensor active_tensor(ActivationParams& params) and add params.output = npu::active(params.input, params.act_mode) here for npu device.

auto output = torch::empty( {batch_size, intermediate_size_ / parallel_args_.tp_group_->world_size()}, gate_up.options());

This is a good modification. However, as described, the current code's output still allocates space preemptively. For NPU operators, they typically allocate their own space and return the result. This unavoidable difference still forces the external calling code to use an #if block to skip space allocation specifically for the NPU case.

To standardize the external calling code, I personally recommend aligning with the NPU's behavior: allocate the space within the operator wrapper/layer and then return it. This approach allows for a unified code structure for all external calls.

so don't add active_tensor and fused_layernorm_tensor these two func in ops_api.h, because no other platform will use such api.
put they in npu_ops_api.h and call them directly in npu layer.

Regarding the code snippet above: if we implement the changes as suggested, we would need to introduce #if directives here to skip memory allocation, since the NPU operator handles this internally.
Could we instead consider moving the memory allocation logic for MLU and CUDA into their respective kernel wrappers? This would make the behavior more similar to PyTorch and allow us to unify the calling code here.

(PS: I haven't modified the CUDA or MLU code yet.)

XuZhang99 · 2025-12-02T15:15:25Z

xllm/core/kernels/ops_api.cpp

 #endif
 }

+torch::Tensor fused_layernorm_tensor(FusedLayerNormParams& params) {


same as above

Similar to the previous comment.

xllm/core/kernels/param.h

XuZhang99 · 2025-12-05T15:08:22Z

For activation ops on npu, revert this commit(refactor: standardize interface for active kernel execution.), and this is what you only need to do:

#elif defined(USE_NPU)
  # make params.output become a null tensor 
  params.output = torch::Tensor();
  params.output = npu::active(params.input, params.act_mode);

yingxudeng · 2025-12-05T15:41:47Z

#474 (comment) Regarding GPU/MLU code, this has no performance impact, as it simply utilizes pre-allocated memory space. However, for NPU code, repeated calls to operations such as RMS normalization may lead to frequent memory allocations and immediate deallocations within the external framework. Could this pattern of repeatedly allocating and promptly discarding memory potentially affect performance on NPU architectures?

xllm/core/kernels/ops_api.cpp

XuZhang99 · 2025-12-06T01:51:52Z

#474 (comment) Regarding GPU/MLU code, this has no performance impact, as it simply utilizes pre-allocated memory space. However, for NPU code, repeated calls to operations such as RMS normalization may lead to frequent memory allocations and immediate deallocations within the external framework. Could this pattern of repeatedly allocating and promptly discarding memory potentially affect performance on NPU architectures?

in dense_mlp.cpp:

torch::Tensor output;
if(Device::type!="npu"){
    output = torch::empty(
        {batch_size,
         intermediate_size_ / parallel_args_.tp_group_->world_size()},
        gate_up.options());
}

btw, you need to learn more about memory management in torch.

xllm/core/kernels/ops_api.cpp

yingxudeng · 2025-12-06T04:04:25Z

#474 (comment) Regarding GPU/MLU code, this has no performance impact, as it simply utilizes pre-allocated memory space. However, for NPU code, repeated calls to operations such as RMS normalization may lead to frequent memory allocations and immediate deallocations within the external framework. Could this pattern of repeatedly allocating and promptly discarding memory potentially affect performance on NPU architectures?

in dense_mlp.cpp:
torch::Tensor output;
if(Device::type!="npu"){
    output = torch::empty(
        {batch_size,
         intermediate_size_ / parallel_args_.tp_group_->world_size()},
        gate_up.options());
}
btw, you need to learn more about memory management in torch.

Thank you for your review. Moving forward, I will replace the unavoidable #if defined macros with runtime checks based on Device::type. Regarding the description of memory allocation and de-allocation, please feel free to disregard it; it was translated by LLM and is not entirely accurate. In short, for NPU models, I want to avoid having an external output = torch::empty.

xllm/core/kernels/npu/custom_functions_npu/operation_cache_compute.h

xllm/core/kernels/npu/custom_functions_npu/atb_common.cpp

XuZhang99 · 2025-12-08T07:28:18Z

xllm/core/kernels/npu/custom_functions_npu/atb_common.cpp

+
+namespace atb {
+atb::Tensor at_tensor_to_atb_tensor(const at::Tensor at_tensor) {
+  static std::map<at::ScalarType, aclDataType> dtype_map = {


unordered_map and maybe we can move dtype_map out of func.

XuZhang99 · 2025-12-08T07:29:41Z

xllm/core/kernels/npu/custom_functions_npu/atb_common.cpp

+    workspace_tensor = at::empty({workspace_size}, options.dtype(at::kByte));
+    workspace_ptr = const_cast<void*>(workspace_tensor.storage().data());
+  }
+  const c10::SmallVector<at::Tensor, N>& cpu_tensors =


change at:: and c10:: to torch::.

xllm/core/kernels/npu/custom_functions_npu/atb_common.h

xllm/core/kernels/npu/ops_npu/npu_ops.h

XuZhang99 · 2025-12-08T07:36:42Z

xllm/core/kernels/npu/ops_npu/reshape_and_cach_atb.cpp

+using namespace std;
+namespace atb {
+using ReshapeAndCacheParam = atb::infer::ReshapeAndCacheParam;
+void _npu_reshape_and_cache(const at::Tensor& key,


why add _ at the begin of func name? it's not cpp style but python.

xllm/core/kernels/npu/active.cpp

xllm/core/kernels/npu/attention.cpp

yq33victor

LGTM

…out.

yq33victor

LGTM

yingxudeng requested review from liutongxuan and yq33victor December 2, 2025 15:06

XuZhang99 reviewed Dec 2, 2025

View reviewed changes

yingxudeng force-pushed the feat/npu_backend_torch_2_kernels branch 7 times, most recently from 277c5fb to 1da759f Compare December 5, 2025 13:16

yingxudeng force-pushed the feat/npu_backend_torch_2_kernels branch from 28e6e79 to a0382bb Compare December 5, 2025 16:15

XuZhang99 reviewed Dec 6, 2025

View reviewed changes

xllm/core/kernels/ops_api.cpp Outdated Show resolved Hide resolved

XuZhang99 reviewed Dec 6, 2025

View reviewed changes

xllm/core/kernels/ops_api.cpp Outdated Show resolved Hide resolved

yingxudeng force-pushed the feat/npu_backend_torch_2_kernels branch 3 times, most recently from 87bd6dc to 0dd081d Compare December 8, 2025 06:18

XuZhang99 reviewed Dec 8, 2025

View reviewed changes

xllm/core/kernels/npu/custom_functions_npu/operation_cache_compute.h Outdated Show resolved Hide resolved

XuZhang99 reviewed Dec 8, 2025

View reviewed changes

yingxudeng force-pushed the feat/npu_backend_torch_2_kernels branch 6 times, most recently from 59842ae to 89df240 Compare December 9, 2025 09:19

yq33victor previously approved these changes Dec 9, 2025

View reviewed changes

XuZhang99 previously approved these changes Dec 9, 2025

View reviewed changes

yingxudeng dismissed stale reviews from XuZhang99 and yq33victor via 56790e3 December 9, 2025 12:28

yingxudeng force-pushed the feat/npu_backend_torch_2_kernels branch from 89df240 to 56790e3 Compare December 9, 2025 12:28

yq33victor previously approved these changes Dec 9, 2025

View reviewed changes

XuZhang99 previously approved these changes Dec 9, 2025

View reviewed changes

yingxudeng dismissed stale reviews from XuZhang99 and yq33victor via 7a42a95 December 10, 2025 02:52

yingxudeng force-pushed the feat/npu_backend_torch_2_kernels branch 2 times, most recently from 7a42a95 to cd27698 Compare December 10, 2025 02:56

yingxudeng added 7 commits December 10, 2025 12:01

feat: add wrappers for ATB and ACLNN fused operators.

4bd08d6

refactor: standardize interface for active kernel execution.

7351429

refactor: redesign wrapper for NPU fused_layernorm operator.

21736f8

refactor: replace header guards with #pragma once.

2d5e08b

refactor: replace TORCH_CHECK with CHECK macros and optimize code lay…

dcff295

…out.

feat: integrate add_rms_norm interface for NPU backend.

9f2e336

feat: add torch_npu_ops library for NPU backend support.

a3ac3eb

yingxudeng force-pushed the feat/npu_backend_torch_2_kernels branch from cd27698 to a3ac3eb Compare December 10, 2025 04:02

yq33victor approved these changes Dec 10, 2025

View reviewed changes

XuZhang99 approved these changes Dec 10, 2025

View reviewed changes

feat: add wrappers for ATB and ACLNN fused operators. #474

Are you sure you want to change the base?

feat: add wrappers for ATB and ACLNN fused operators. #474

Uh oh!

Conversation

yingxudeng commented Dec 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yingxudeng Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

XuZhang99 commented Dec 5, 2025

Uh oh!

yingxudeng commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

XuZhang99 commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

yingxudeng commented Dec 6, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yq33victor left a comment

Choose a reason for hiding this comment

Uh oh!

yq33victor left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yingxudeng Dec 2, 2025 •

edited

Loading

yingxudeng commented Dec 5, 2025 •

edited

Loading

XuZhang99 commented Dec 6, 2025 •

edited

Loading