Skip to content

Conversation

@larryliu0820
Copy link
Contributor

@larryliu0820 larryliu0820 commented Dec 12, 2025

Summary

This PR adds the ability to skip copying GPU outputs back to CPU for specific methods in the CUDA backend, and enables conditional CUDA compilation for the ASR runner.

Changes

CUDA Backend (backends/cuda/runtime/cuda_backend.cpp)

  • Changed the skip_copy_output_to_cpu_for_method backend option from a boolean to a string that accepts a method name
  • The option can be set via set_option() after init() is called, allowing runtime configuration
  • During execute(), the backend compares the configured method name against the handle's method name to decide whether to skip the GPU→CPU output copy
  • Thread-safe access to the skip-copy method name via mutex

Usage example:

BackendOptions<1> options;
options.set_option("skip_copy_output_to_cpu_for_method", "encode");
set_option("CudaBackend", options.view());

AOTI Delegate Handle (backends/aoti/aoti_delegate_handle.h)

  • Added method_name field to AOTIDelegateHandle to track which method each delegate handle corresponds to

ASR Runner CMake (extension/asr/runner/CMakeLists.txt)

  • Added conditional CUDA support: when EXECUTORCH_BUILD_CUDA is enabled and CUDAToolkit is found, the CUDA_AVAILABLE compile definition is added
  • This allows ASR runner code to conditionally compile CUDA-aware code paths

Motivation

When running multi-method models (e.g., prefill + decode for LLMs, or encoder + decoder for ASR), some methods benefit from keeping outputs on GPU to avoid unnecessary memory copies between methods. This change enables fine-grained control over which method(s) skip the GPU→CPU copy.

Test Plan

  • Build with CUDA enabled
  • Verify skip_copy_output_to_cpu_for_method option works for specified method
  • Verify other methods still copy outputs to CPU by default

Perf improvement comparing with main:

Screenshot 2025-12-15 at 10 32 13 AM

@pytorch-bot
Copy link

pytorch-bot bot commented Dec 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16235

Note: Links to docs will display an error until the docs builds have been completed.

❌ 20 New Failures, 1 Cancelled Job, 2 Unrelated Failures

As of commit fad0cf2 with merge base df626bd (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following job failed but was likely due to flakiness present on trunk:

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 12, 2025
@larryliu0820 larryliu0820 force-pushed the avoid_copy_output branch 2 times, most recently from f3e30a4 to d1df034 Compare December 13, 2025 00:09
@larryliu0820 larryliu0820 added the release notes: desktop for desktop/laptop workstream label Dec 13, 2025
@larryliu0820 larryliu0820 changed the title Avoid copying output from GPU to CPU Avoid copying output from GPU to CPU for ASR runner Dec 13, 2025
@larryliu0820 larryliu0820 marked this pull request as ready for review December 13, 2025 00:22
@larryliu0820 larryliu0820 temporarily deployed to upload-benchmark-results December 13, 2025 01:25 — with GitHub Actions Inactive
@larryliu0820 larryliu0820 temporarily deployed to upload-benchmark-results December 14, 2025 01:04 — with GitHub Actions Inactive
@mergennachin mergennachin self-requested a review December 15, 2025 01:34
Copy link
Contributor

@mergennachin mergennachin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great -- simplifies quite a bit.

Three comments:

  1. Can you bring RAII cleanup pattern from #16060. We should ensure GPU tensors are properly cleaned up on error paths.
  struct TensorCleanup {
      std::vector<AOTITensorHandle>& tensors;
      ~TensorCleanup() {
          for (auto* handle : tensors) {
              if (handle != nullptr) {
                  aoti_torch_delete_tensor_object(handle);
              }
          }
      }
  };
  1. Can you add a simple clear_gpu_outputs option to free GPU memory when done with encoder-decoder loops

  2. Even in the non-error case, isn't it leaking memory? How/when are the GPU tensors deleted?

}

private:
mutable std::mutex skip_copy_method_mutex_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need mutex at all?

We enforce the callers of ExecuTorch to be aware of thread safety, and don't guarantee any thread safety within the internals of ET.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the pattern in xnnpack so would like to keep it the same way.

mutable std::mutex weights_cache_mutex_;

@larryliu0820
Copy link
Contributor Author

3. Even in the non-error case, isn't it leaking memory? How/when are the GPU tensors deleted?

Each CudaBackend instance will keep references to the GPU tensors (both encoder and decoder) and will be freed when it's getting destroyed.

@larryliu0820
Copy link
Contributor Author

  1. Even in the non-error case, isn't it leaking memory? How/when are the GPU tensors deleted?

Each CudaBackend instance will keep references to the GPU tensors (both encoder and decoder) and will be freed when it's getting destroyed.

@Gasoonjia can you comment on if this still holds true, after slimtensor migration?

@larryliu0820
Copy link
Contributor Author

  • Can you add a simple clear_gpu_outputs option to free GPU memory when done with encoder-decoder loops

I don't think we need that, if the destroy() hook is reliable.

@Gasoonjia
Copy link
Contributor

  1. Even in the non-error case, isn't it leaking memory? How/when are the GPU tensors deleted?

Each CudaBackend instance will keep references to the GPU tensors (both encoder and decoder) and will be freed when it's getting destroyed.

@Gasoonjia can you comment on if this still holds true, after slimtensor migration?

i think it is no longer the case if we migrate to slimtensor since at that time the memory will be controled in tensor level.

But for this case i think we can do something simliar, e..g making cudabackend keep the reference to the output slim tensor to make it still alive after preprocessing and freed in destroy function, to make sure the pipeline still works

Copy link
Contributor

@Gasoonjia Gasoonjia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, only some minor feedback. thanks for optimize the data pipeline!

}
}
// Clean up output tensors
for (auto* handle : outputs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may need to keep the output tensors alive, otherwise the another round of input may failed on getting the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I think that's fair, basically we don't want to cleanup outputs when exiting execute() and probably will rely on destroy() for cleaning it up.


private:
mutable std::mutex skip_copy_method_mutex_;
std::string skip_copy_method_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't it be an arrary of string cus we may skip copy on multiple methods? Its ok to be update in the following PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah next PR will be supporting a comma separated string.

@larryliu0820 larryliu0820 force-pushed the avoid_copy_output branch 2 times, most recently from fdbc552 to d53f33e Compare December 16, 2025 18:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: desktop for desktop/laptop workstream

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants