-
Notifications
You must be signed in to change notification settings - Fork 93
feat: add wrappers for ATB and ACLNN fused operators. #474
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: add wrappers for ATB and ACLNN fused operators. #474
Conversation
| #elif defined(USE_CUDA) | ||
| cuda::act_and_mul(params.output, params.input, params.act_mode); | ||
| #else | ||
| LOG(FATAL) << "active not implemented"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove torch::Tensor active_tensor(ActivationParams& params) and add params.output = npu::active(params.input, params.act_mode) here for npu device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto output = torch::empty(
{batch_size,
intermediate_size_ / parallel_args_.tp_group_->world_size()},
gate_up.options());
This is a good modification. However, as described, the current code's output still allocates space preemptively. For NPU operators, they typically allocate their own space and return the result. This unavoidable difference still forces the external calling code to use an #if block to skip space allocation specifically for the NPU case.
To standardize the external calling code, I personally recommend aligning with the NPU's behavior: allocate the space within the operator wrapper/layer and then return it. This approach allows for a unified code structure for all external calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so don't add active_tensor and fused_layernorm_tensor these two func in ops_api.h, because no other platform will use such api.
put they in npu_ops_api.h and call them directly in npu layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the code snippet above: if we implement the changes as suggested, we would need to introduce #if directives here to skip memory allocation, since the NPU operator handles this internally.
Could we instead consider moving the memory allocation logic for MLU and CUDA into their respective kernel wrappers? This would make the behavior more similar to PyTorch and allow us to unify the calling code here.
(PS: I haven't modified the CUDA or MLU code yet.)
xllm/core/kernels/ops_api.cpp
Outdated
| #endif | ||
| } | ||
|
|
||
| torch::Tensor fused_layernorm_tensor(FusedLayerNormParams& params) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the previous comment.
277c5fb to
1da759f
Compare
|
For activation ops on npu, revert this commit(refactor: standardize interface for active kernel execution.), and this is what you only need to do: #elif defined(USE_NPU)
# make params.output become a null tensor
params.output = torch::Tensor();
params.output = npu::active(params.input, params.act_mode); |
|
#474 (comment) Regarding GPU/MLU code, this has no performance impact, as it simply utilizes pre-allocated memory space. However, for NPU code, repeated calls to operations such as RMS normalization may lead to frequent memory allocations and immediate deallocations within the external framework. Could this pattern of repeatedly allocating and promptly discarding memory potentially affect performance on NPU architectures? |
28e6e79 to
a0382bb
Compare
in dense_mlp.cpp: torch::Tensor output;
if(Device::type!="npu"){
output = torch::empty(
{batch_size,
intermediate_size_ / parallel_args_.tp_group_->world_size()},
gate_up.options());
}btw, you need to learn more about memory management in torch. |
Thank you for your review. Moving forward, I will replace the unavoidable |
87bd6dc to
0dd081d
Compare
xllm/core/kernels/npu/custom_functions_npu/operation_cache_compute.h
Outdated
Show resolved
Hide resolved
|
|
||
| namespace atb { | ||
| atb::Tensor at_tensor_to_atb_tensor(const at::Tensor at_tensor) { | ||
| static std::map<at::ScalarType, aclDataType> dtype_map = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unordered_map and maybe we can move dtype_map out of func.
| workspace_tensor = at::empty({workspace_size}, options.dtype(at::kByte)); | ||
| workspace_ptr = const_cast<void*>(workspace_tensor.storage().data()); | ||
| } | ||
| const c10::SmallVector<at::Tensor, N>& cpu_tensors = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change at:: and c10:: to torch::.
| using namespace std; | ||
| namespace atb { | ||
| using ReshapeAndCacheParam = atb::infer::ReshapeAndCacheParam; | ||
| void _npu_reshape_and_cache(const at::Tensor& key, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why add _ at the begin of func name? it's not cpp style but python.
59842ae to
89df240
Compare
yq33victor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
89df240 to
56790e3
Compare
7a42a95
7a42a95 to
cd27698
Compare
cd27698 to
a3ac3eb
Compare
yq33victor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
No description provided.