Avoid copying output from GPU to CPU for ASR runner #16235

larryliu0820 · 2025-12-12T23:34:08Z

Summary

This PR adds the ability to skip copying GPU outputs back to CPU for specific methods in the CUDA backend, and enables conditional CUDA compilation for the ASR runner.

Changes

CUDA Backend (`backends/cuda/runtime/cuda_backend.cpp`)

Changed the skip_copy_output_to_cpu_for_method backend option from a boolean to a string that accepts a method name
The option can be set via set_option() after init() is called, allowing runtime configuration
During execute(), the backend compares the configured method name against the handle's method name to decide whether to skip the GPU→CPU output copy
Thread-safe access to the skip-copy method name via mutex

Usage example:

BackendOptions<1> options;
options.set_option("skip_copy_output_to_cpu_for_method", "encode");
set_option("CudaBackend", options.view());

AOTI Delegate Handle (`backends/aoti/aoti_delegate_handle.h`)

Added method_name field to AOTIDelegateHandle to track which method each delegate handle corresponds to

ASR Runner CMake (`extension/asr/runner/CMakeLists.txt`)

Added conditional CUDA support: when EXECUTORCH_BUILD_CUDA is enabled and CUDAToolkit is found, the CUDA_AVAILABLE compile definition is added
This allows ASR runner code to conditionally compile CUDA-aware code paths

Motivation

When running multi-method models (e.g., prefill + decode for LLMs, or encoder + decoder for ASR), some methods benefit from keeping outputs on GPU to avoid unnecessary memory copies between methods. This change enables fine-grained control over which method(s) skip the GPU→CPU copy.

Test Plan

Build with CUDA enabled
Verify skip_copy_output_to_cpu_for_method option works for specified method
Verify other methods still copy outputs to CPU by default

Perf improvement comparing with main:

pytorch-bot · 2025-12-12T23:34:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16235

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 20 New Failures, 1 Cancelled Job, 2 Unrelated Failures

As of commit fad0cf2 with merge base df626bd ():

NEW FAILURES - The following jobs have failed:

pull / test-openvino-linux / linux-job (gh)
RuntimeError: Command docker exec -t f8fb3dad6a676ca47c284a495b74be701c544a5833503af9932582976068634c /exec failed with exit code 1
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, non-quantized) / linux-job (gh)
/pytorch/executorch/extension/asr/runner/runner.cpp:118:28: error: ‘set_option’ is not a member of ‘executorch::runtime’
Test CUDA Builds / test-model-cuda-e2e (google, gemma-3-4b-it, quantized-int4-tile-packed) / linux-job (gh)
/pytorch/executorch/extension/asr/runner/runner.cpp:118:28: error: ‘set_option’ is not a member of ‘executorch::runtime’
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, non-quantized) / linux-job (gh)
/pytorch/executorch/extension/asr/runner/runner.cpp:118:28: error: ‘set_option’ is not a member of ‘executorch::runtime’
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-tile-packed) / linux-job (gh)
/pytorch/executorch/extension/asr/runner/runner.cpp:118:28: error: ‘set_option’ is not a member of ‘executorch::runtime’
Test CUDA Builds / test-model-cuda-e2e (mistralai, Voxtral-Mini-3B-2507, quantized-int4-weight-only) / linux-job (gh)
/pytorch/executorch/extension/asr/runner/runner.cpp:118:28: error: ‘set_option’ is not a member of ‘executorch::runtime’
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, non-quantized) / linux-job (gh)
/pytorch/executorch/extension/asr/runner/runner.cpp:118:28: error: ‘set_option’ is not a member of ‘executorch::runtime’
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-tile-packed) / linux-job (gh)
/pytorch/executorch/extension/asr/runner/runner.cpp:118:28: error: ‘set_option’ is not a member of ‘executorch::runtime’
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-large-v3-turbo, quantized-int4-weight-only) / linux-job (gh)
/pytorch/executorch/extension/asr/runner/runner.cpp:118:28: error: ‘set_option’ is not a member of ‘executorch::runtime’
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, non-quantized) / linux-job (gh)
/pytorch/executorch/extension/asr/runner/runner.cpp:118:28: error: ‘set_option’ is not a member of ‘executorch::runtime’
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-tile-packed) / linux-job (gh)
/pytorch/executorch/extension/asr/runner/runner.cpp:118:28: error: ‘set_option’ is not a member of ‘executorch::runtime’
Test CUDA Builds / test-model-cuda-e2e (openai, whisper-small, quantized-int4-weight-only) / linux-job (gh)
/pytorch/executorch/extension/asr/runner/runner.cpp:118:28: error: ‘set_option’ is not a member of ‘executorch::runtime’
trunk / test-huggingface-transformers-xnnpack (phi4-mini|xnnpack|--quantize) / linux-job (gh)
RuntimeError: Command docker exec -t 6dd4f63d0d89bf759e003bd644077533b4108981342380ab059eb6f572e0b4b4 /exec failed with exit code 1
trunk / test-qnn-model (fp32, mb) / linux-job (gh)
AttributeError: BertTokenizer has no attribute batch_encode_plus. Did you mean: '_encode_plus'?
trunk / test-qnn-optimum-model (fp32, albert) / linux-job (gh)
RuntimeError: Command docker exec -t 0077756f2711be0cb39a5f31c9440cc819dcb03c6f96110c6e4273d61c40db6a /exec failed with exit code 1
trunk / test-qnn-optimum-model (fp32, bert) / linux-job (gh)
RuntimeError: Command docker exec -t bdbfcb81ced27d6e040fcb4b9b4b0777ad99455f24739f4393d766987f91d73e /exec failed with exit code 1
trunk / test-qnn-optimum-model (fp32, mobilevit_v1) / linux-job (gh)
RuntimeError: Command docker exec -t ec0ea16d5c7a6383d137d7ef36a0d656b551e27e536b059a6398f06f0a4ebe2c /exec failed with exit code 1
trunk / test-qnn-optimum-model (fp32, mobilevit_v2) / linux-job (gh)
RuntimeError: Command docker exec -t 4530e43802a28f268b0da4370f7a102bd79ff54f10595c925eb5001c9e8569ba /exec failed with exit code 1
trunk / test-qnn-optimum-model (fp32, roberta) / linux-job (gh)
RuntimeError: Command docker exec -t bc2039e3c776e13c6696480efbe0a8e186df72c374b82da1bab3fec7c84abda8 /exec failed with exit code 1
trunk / unittest-release / windows / windows-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_mv2_model

CANCELLED JOB - The following job was cancelled. Please retry:

pull / test-models-linux-basic (vit, portable, cmake, linux.2xlarge, executorch-ubuntu-22.04-clang12) / linux-job (gh)

FLAKY - The following job failed but was likely due to flakiness present on trunk:

trunk / test-qnn-model (fp32, mv2) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

pull / android / run-emulator (gh) (#16137)
Timeout waiting for emulator to boot.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mergennachin

Great -- simplifies quite a bit.

Three comments:

Can you bring RAII cleanup pattern from #16060. We should ensure GPU tensors are properly cleaned up on error paths.

  struct TensorCleanup {
      std::vector<AOTITensorHandle>& tensors;
      ~TensorCleanup() {
          for (auto* handle : tensors) {
              if (handle != nullptr) {
                  aoti_torch_delete_tensor_object(handle);
              }
          }
      }
  };

Can you add a simple clear_gpu_outputs option to free GPU memory when done with encoder-decoder loops
Even in the non-error case, isn't it leaking memory? How/when are the GPU tensors deleted?

mergennachin · 2025-12-15T16:28:49Z

backends/cuda/runtime/cuda_backend.cpp

  }
+
+ private:
+  mutable std::mutex skip_copy_method_mutex_;


Do you need mutex at all?

We enforce the callers of ExecuTorch to be aware of thread safety, and don't guarantee any thread safety within the internals of ET.

I see the pattern in xnnpack so would like to keep it the same way.

executorch/backends/xnnpack/runtime/XNNPACKBackend.cpp

Line 252 in 1ea3907

mutable std::mutex weights_cache_mutex_;

larryliu0820 · 2025-12-15T18:37:06Z

3. Even in the non-error case, isn't it leaking memory? How/when are the GPU tensors deleted?

Each CudaBackend instance will keep references to the GPU tensors (both encoder and decoder) and will be freed when it's getting destroyed.

larryliu0820 · 2025-12-15T18:55:35Z

Even in the non-error case, isn't it leaking memory? How/when are the GPU tensors deleted?

Each CudaBackend instance will keep references to the GPU tensors (both encoder and decoder) and will be freed when it's getting destroyed.

@Gasoonjia can you comment on if this still holds true, after slimtensor migration?

larryliu0820 · 2025-12-15T19:00:08Z

Can you add a simple clear_gpu_outputs option to free GPU memory when done with encoder-decoder loops

I don't think we need that, if the destroy() hook is reliable.

Gasoonjia · 2025-12-15T19:16:03Z

Even in the non-error case, isn't it leaking memory? How/when are the GPU tensors deleted?

Each CudaBackend instance will keep references to the GPU tensors (both encoder and decoder) and will be freed when it's getting destroyed.

@Gasoonjia can you comment on if this still holds true, after slimtensor migration?

i think it is no longer the case if we migrate to slimtensor since at that time the memory will be controled in tensor level.

But for this case i think we can do something simliar, e..g making cudabackend keep the reference to the output slim tensor to make it still alive after preprocessing and freed in destroy function, to make sure the pipeline still works

Gasoonjia

LGTM, only some minor feedback. thanks for optimize the data pipeline!

Gasoonjia · 2025-12-15T23:46:32Z

backends/cuda/runtime/cuda_backend.cpp

+          }
+        }
+        // Clean up output tensors
+        for (auto* handle : outputs) {


I think we may need to keep the output tensors alive, otherwise the another round of input may failed on getting the data.

ok I think that's fair, basically we don't want to cleanup outputs when exiting execute() and probably will rely on destroy() for cleaning it up.

Gasoonjia · 2025-12-16T00:07:07Z

backends/cuda/runtime/cuda_backend.cpp

+
+ private:
+  mutable std::mutex skip_copy_method_mutex_;
+  std::string skip_copy_method_;


shouldn't it be an arrary of string cus we may skip copy on multiple methods? Its ok to be update in the following PRs.

Yeah next PR will be supporting a comma separated string.

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 12, 2025

larryliu0820 force-pushed the avoid_copy_output branch 2 times, most recently from f3e30a4 to d1df034 Compare December 13, 2025 00:09

larryliu0820 added the release notes: desktop for desktop/laptop workstream label Dec 13, 2025

larryliu0820 force-pushed the avoid_copy_output branch from d1df034 to 1b3c35c Compare December 13, 2025 00:17

larryliu0820 changed the title ~~Avoid copying output from GPU to CPU~~ Avoid copying output from GPU to CPU for ASR runner Dec 13, 2025

larryliu0820 marked this pull request as ready for review December 13, 2025 00:22

larryliu0820 requested a review from kirklandsign as a code owner December 13, 2025 00:22

larryliu0820 temporarily deployed to upload-benchmark-results December 13, 2025 01:25 — with GitHub Actions Inactive

larryliu0820 force-pushed the avoid_copy_output branch from 1b3c35c to aa293f9 Compare December 13, 2025 23:35

larryliu0820 temporarily deployed to upload-benchmark-results December 14, 2025 01:04 — with GitHub Actions Inactive

mergennachin self-requested a review December 15, 2025 01:34

mergennachin reviewed Dec 15, 2025

View reviewed changes

larryliu0820 force-pushed the avoid_copy_output branch from aa293f9 to 542e42d Compare December 15, 2025 19:01

Gasoonjia reviewed Dec 16, 2025

View reviewed changes

larryliu0820 force-pushed the avoid_copy_output branch 2 times, most recently from fdbc552 to d53f33e Compare December 16, 2025 18:11

Gasoonjia approved these changes Dec 16, 2025

View reviewed changes

larryliu0820 force-pushed the avoid_copy_output branch from d53f33e to 8d2f811 Compare December 16, 2025 19:34

Avoid copying output from GPU to CPU

fad0cf2

larryliu0820 force-pushed the avoid_copy_output branch from 8d2f811 to fad0cf2 Compare December 17, 2025 00:15

Avoid copying output from GPU to CPU for ASR runner #16235

Are you sure you want to change the base?

Avoid copying output from GPU to CPU for ASR runner #16235

Conversation

larryliu0820 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

CUDA Backend (backends/cuda/runtime/cuda_backend.cpp)

AOTI Delegate Handle (backends/aoti/aoti_delegate_handle.h)

ASR Runner CMake (extension/asr/runner/CMakeLists.txt)

Motivation

Test Plan

Uh oh!

pytorch-bot bot commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/16235

❌ 20 New Failures, 1 Cancelled Job, 2 Unrelated Failures

Uh oh!

mergennachin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergennachin Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

larryliu0820 commented Dec 15, 2025

Uh oh!

larryliu0820 commented Dec 15, 2025

Uh oh!

larryliu0820 commented Dec 15, 2025

Uh oh!

Gasoonjia commented Dec 15, 2025

Uh oh!

Gasoonjia left a comment

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Gasoonjia Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

larryliu0820 Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

larryliu0820 commented Dec 12, 2025 •

edited

Loading

CUDA Backend (`backends/cuda/runtime/cuda_backend.cpp`)

AOTI Delegate Handle (`backends/aoti/aoti_delegate_handle.h`)

ASR Runner CMake (`extension/asr/runner/CMakeLists.txt`)

pytorch-bot bot commented Dec 12, 2025 •

edited

Loading

mergennachin left a comment •

edited

Loading