[ExecuTorch][WebGPU] Add per-pass dispatch ordering + scratch buffer tests#20080
[ExecuTorch][WebGPU] Add per-pass dispatch ordering + scratch buffer tests#20080JulianCloudNTH wants to merge 14 commits into
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20080
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 Unclassified FailureAs of commit b1af951 with merge base ff2bf9c ( UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
This PR needs a
|
…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390552321 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)
…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390555327 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)
…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390566134 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)
…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390584942 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)
|
@claude review |
|
Claude finished @JulianCloudNTH's task in 1m 36s —— View job Code Review: Per-pass dispatch ordering + scratch buffer tests
Overall this is a well-structured PR adding native unit tests for two distinct runtime features. The code is clean, follows the existing patterns in Feedback1. No timeout on async poll loops — while (!cb.done) {
webgpu_poll(instance, device);
}If the GPU hangs or a callback is never delivered, this loops forever. Consider adding a simple iteration counter with a bail-out (e.g. 10k iterations → print FAIL and return false). Same pattern is used in the existing 2. C++/Python case table sync is manual — The C++ Not a blocker — just a maintenance risk as the test suite grows. 3. Short-circuit evaluation in test aggregation — ok = tier1_alloc(ctx.device) && ok;
ok = tier2_roundtrip(ctx.instance, ctx.device, ctx.queue) && ok;
ok = tier3_compute(ctx.instance, ctx.device, ctx.queue) && ok;
ok = tier3_lifecycle(ctx.device) && ok;Because 4. Using 5. const size_t bytes =
static_cast<size_t>(f.tellg()) / sizeof(float) * sizeof(float);If the file size isn't a multiple of 4 bytes (corrupt file), this silently drops trailing bytes rather than flagging the issue. In a test utility, it might be worth asserting Positives
VerdictLooks good. The issues above are all minor/nits — no correctness bugs or security concerns. Ship it. |
…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390595577 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)
…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390607054 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)
…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390670478 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)
…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390672000 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)
…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390672000 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)
…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 391074816 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)
Summary: Update the WebGPU backend README to reflect the current state of the backend: - Add a Progress section listing milestones landed on `main` (pytorch#18808, pytorch#19963, pytorch#19964, pytorch#19981, pytorch#20036) and work in review (pytorch#20079, pytorch#20080), each linking its pull request. - Update the operator support table to include `rms_norm` and refresh the planned/roadmap list toward end-to-end LLM inference. - Update the directory structure to match the current layout. Docs-only change; no code or build impact. Differential Revision: D107742574
Summary: Update the WebGPU backend README to reflect the current state of the backend: - Add a Progress section listing milestones landed on `main` (pytorch#18808, pytorch#19963, pytorch#19964, pytorch#19981, pytorch#20036) and work in review (pytorch#20079, pytorch#20080), each linking its pull request. - Update the operator support table to include `rms_norm` and refresh the planned/roadmap list toward end-to-end LLM inference. - Update the directory structure to match the current layout. Docs-only change; no code or build impact. Differential Revision: D107742574
Stack from ghstack (oldest at bottom):
Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259).
test/native/test_dispatch_order.cppexercises multi-dispatch read-after-write ordering through a singleexecute()using dependency chains -- a single-inputaddself-chain and a heterogeneousrms_norm->addchain, both lowered viaVulkanPartitioner-- comparing GPU output to a torch-computed golden per element.test/native/test_scratch_buffer.cppis a white-box test ofWebGPUGraph::create_scratch_buffer(no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude.@exported-using-ghexport
Differential Revision: D107576199
Differential Revision: D107576199