-
Notifications
You must be signed in to change notification settings - Fork 758
Avoid copying output from GPU to CPU for ASR runner #16235
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16235
Note: Links to docs will display an error until the docs builds have been completed. ❌ 20 New Failures, 1 Cancelled Job, 2 Unrelated FailuresAs of commit fad0cf2 with merge base df626bd ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following job failed but was likely due to flakiness present on trunk:
UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
f3e30a4 to
d1df034
Compare
d1df034 to
1b3c35c
Compare
1b3c35c to
aa293f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great -- simplifies quite a bit.
Three comments:
- Can you bring RAII cleanup pattern from #16060. We should ensure GPU tensors are properly cleaned up on error paths.
struct TensorCleanup {
std::vector<AOTITensorHandle>& tensors;
~TensorCleanup() {
for (auto* handle : tensors) {
if (handle != nullptr) {
aoti_torch_delete_tensor_object(handle);
}
}
}
};
-
Can you add a simple clear_gpu_outputs option to free GPU memory when done with encoder-decoder loops
-
Even in the non-error case, isn't it leaking memory? How/when are the GPU tensors deleted?
| } | ||
|
|
||
| private: | ||
| mutable std::mutex skip_copy_method_mutex_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need mutex at all?
We enforce the callers of ExecuTorch to be aware of thread safety, and don't guarantee any thread safety within the internals of ET.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the pattern in xnnpack so would like to keep it the same way.
| mutable std::mutex weights_cache_mutex_; |
Each |
@Gasoonjia can you comment on if this still holds true, after slimtensor migration? |
I don't think we need that, if the destroy() hook is reliable. |
aa293f9 to
542e42d
Compare
i think it is no longer the case if we migrate to slimtensor since at that time the memory will be controled in tensor level. But for this case i think we can do something simliar, e..g making |
Gasoonjia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, only some minor feedback. thanks for optimize the data pipeline!
| } | ||
| } | ||
| // Clean up output tensors | ||
| for (auto* handle : outputs) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we may need to keep the output tensors alive, otherwise the another round of input may failed on getting the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok I think that's fair, basically we don't want to cleanup outputs when exiting execute() and probably will rely on destroy() for cleaning it up.
|
|
||
| private: | ||
| mutable std::mutex skip_copy_method_mutex_; | ||
| std::string skip_copy_method_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't it be an arrary of string cus we may skip copy on multiple methods? Its ok to be update in the following PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah next PR will be supporting a comma separated string.
fdbc552 to
d53f33e
Compare
d53f33e to
8d2f811
Compare
8d2f811 to
fad0cf2
Compare
Summary
This PR adds the ability to skip copying GPU outputs back to CPU for specific methods in the CUDA backend, and enables conditional CUDA compilation for the ASR runner.
Changes
CUDA Backend (
backends/cuda/runtime/cuda_backend.cpp)skip_copy_output_to_cpu_for_methodbackend option from a boolean to a string that accepts a method nameset_option()afterinit()is called, allowing runtime configurationexecute(), the backend compares the configured method name against the handle's method name to decide whether to skip the GPU→CPU output copyUsage example:
AOTI Delegate Handle (
backends/aoti/aoti_delegate_handle.h)method_namefield toAOTIDelegateHandleto track which method each delegate handle corresponds toASR Runner CMake (
extension/asr/runner/CMakeLists.txt)EXECUTORCH_BUILD_CUDAis enabled andCUDAToolkitis found, theCUDA_AVAILABLEcompile definition is addedMotivation
When running multi-method models (e.g., prefill + decode for LLMs, or encoder + decoder for ASR), some methods benefit from keeping outputs on GPU to avoid unnecessary memory copies between methods. This change enables fine-grained control over which method(s) skip the GPU→CPU copy.
Test Plan
skip_copy_output_to_cpu_for_methodoption works for specified methodPerf improvement comparing with main: