[ExecuTorch][WebGPU] Add per-pass dispatch ordering + scratch buffer tests by JulianCloudNTH · Pull Request #20080 · pytorch/executorch

JulianCloudNTH · 2026-06-05T21:31:05Z

Stack from ghstack (oldest at bottom):

[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20087
[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20086
[ExecuTorch][WebGPU] SymInt live-scalar mechanism + et_vk.select_as_symint #20085
[ExecuTorch][WebGPU] Add update_cache tests (native numeric + export) #20084
[ExecuTorch][WebGPU] Add update_cache op (llama.update_cache) #20083
-> [ExecuTorch][WebGPU] Add per-pass dispatch ordering + scratch buffer tests #20080
[ExecuTorch][WebGPU] Switch native backend from wgpu-native to Dawn (Tint) + SwiftShader #20079
[ExecuTorch][WebGPU] Graph-owned scratch buffers for fused-op intermediates #20073
[ExecuTorch][WebGPU] Per-pass compute dispatch ordering for fused multi-dispatch ops #20072

Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). test/native/test_dispatch_order.cpp exercises multi-dispatch read-after-write ordering through a single execute() using dependency chains -- a single-input add self-chain and a heterogeneous rms_norm -> add chain, both lowered via VulkanPartitioner -- comparing GPU output to a torch-computed golden per element. test/native/test_scratch_buffer.cpp is a white-box test of WebGPUGraph::create_scratch_buffer (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude.
@exported-using-ghexport

Differential Revision: D107576199

[ghstack-poisoned]

pytorch-bot · 2026-06-05T21:31:08Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20080

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Unclassified Failure

As of commit b1af951 with merge base ff2bf9c ():

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

Test WebGPU Native (Dawn) / test-webgpu-native / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
/pytorch/executorch/backends/webgpu/test/native/test_scratch_buffer.cpp:67:3: error: use of undeclared identifier 'webgpu_wait'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-05T21:32:03Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390552321 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)

[ghstack-poisoned]

…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390555327 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)

…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390566134 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)

[ghstack-poisoned]

…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390584942 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)

JulianCloudNTH · 2026-06-05T23:28:35Z

@claude review

claude · 2026-06-05T23:29:24Z

Claude finished @JulianCloudNTH's task in 1m 36s —— View job

Code Review: Per-pass dispatch ordering + scratch buffer tests

Read changed files
Review test_dispatch_order.cpp
Review test_scratch_buffer.cpp
Review test_dispatch_order.py
Review CMakeLists.txt and shell script changes
Post review feedback

Overall this is a well-structured PR adding native unit tests for two distinct runtime features. The code is clean, follows the existing patterns in backends/webgpu/test/, and provides good coverage. A few observations below:

Feedback

1. No timeout on async poll loops — test_scratch_buffer.cpp:69

while (!cb.done) {
  webgpu_poll(instance, device);
}

If the GPU hangs or a callback is never delivered, this loops forever. Consider adding a simple iteration counter with a bail-out (e.g. 10k iterations → print FAIL and return false). Same pattern is used in the existing test_webgpu_native.cpp so this may be a pre-existing convention, but for a test that exercises lifecycle stress (50 graphs × 256 buffers), a hung test is worth guarding against.

2. C++/Python case table sync is manual — test_dispatch_order.cpp:33 vs test_dispatch_order.py:54

The C++ kCases must mirror _CASES in the Python exporter. The comment notes this, but there's no compile-time or runtime check that the sets match (e.g. if a new case is added to Python but not C++, it silently passes). A lightweight guard: the C++ test could scan the directory for *.pte files and warn if any don't match kCases, or the Python exporter could write a manifest.

Not a blocker — just a maintenance risk as the test suite grows.

3. Short-circuit evaluation in test aggregation — test_scratch_buffer.cpp:249-252

ok = tier1_alloc(ctx.device) && ok;
ok = tier2_roundtrip(ctx.instance, ctx.device, ctx.queue) && ok;
ok = tier3_compute(ctx.instance, ctx.device, ctx.queue) && ok;
ok = tier3_lifecycle(ctx.device) && ok;

Because ok is on the right side of &&, if tier1_alloc returns false, tier2/tier3 are still evaluated (C++ evaluates left-to-right, but ok is on the right so the function call is always the LHS). Actually wait — ok = tier1_alloc(...) && ok evaluates tier1_alloc(...) first; if it returns false, ok on the right is not evaluated but it doesn't matter since the result is already false. On the next line, ok is now false, so tier2_roundtrip(...) && false — the function IS still called (it's on the left). So all tiers always run. This is fine but slightly non-obvious — a reader might expect short-circuiting to skip later tiers. Minor readability point.

4. VulkanPartitioner coupling in WebGPU test — test_dispatch_order.py:19

Using VulkanPartitioner for a WebGPU test is correct (WebGPU reuses Vulkan's FlatBuffer serialization, as documented in CMakeLists.txt:17-26), but a first-time reader of this test in backends/webgpu/ might be confused by the Vulkan import. The module docstring at line 13 could note this: "Lowered via VulkanPartitioner since WebGPU shares Vulkan's serialization format."

5. read_f32_bin silently truncates — test_dispatch_order.cpp:50-51

const size_t bytes =
    static_cast<size_t>(f.tellg()) / sizeof(float) * sizeof(float);

If the file size isn't a multiple of 4 bytes (corrupt file), this silently drops trailing bytes rather than flagging the issue. In a test utility, it might be worth asserting file_size % sizeof(float) == 0.

Positives

Tiered structure in test_scratch_buffer.cpp is excellent — progressively testing allocation → copy → compute → lifecycle stress, so failures pinpoint the broken layer.
Tolerance logic in test_dispatch_order.cpp:123 (pass if abs≤tol OR rel≤tol) handles near-zero goldens gracefully.
Build integration is clean — test_build_webgpu.sh properly gates rms_norm-dependent tests and the new tests follow the existing add_webgpu_native_test CMake pattern.
The export_dispatch_order_cases function with fixed torch.manual_seed(0) ensures reproducible goldens.

Verdict

Looks good. The issues above are all minor/nits — no correctness bugs or security concerns. Ship it.

[ghstack-poisoned]

…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390595577 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)

[ghstack-poisoned]

…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390607054 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)

[ghstack-poisoned]

…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390670478 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)

[ghstack-poisoned]

…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 390672000 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)

[ghstack-poisoned]

…tests Pull Request resolved: #20080 Native unit tests for two runtime enablers: per-pass compute-dispatch ordering (D107543258) and graph-owned scratch buffers (D107543259). `test/native/test_dispatch_order.cpp` exercises multi-dispatch read-after-write ordering through a single `execute()` using dependency chains -- a single-input `add` self-chain and a heterogeneous `rms_norm` -> `add` chain, both lowered via `VulkanPartitioner` -- comparing GPU output to a torch-computed golden per element. `test/native/test_scratch_buffer.cpp` is a white-box test of `WebGPUGraph::create_scratch_buffer` (no black-box consumer exists below the SDPA op): allocation + zero-size guard, copy round-trip, a compute Storage round-trip (its actual use), and a create/destroy lifecycle stress. Authored with assistance from Claude. ghstack-source-id: 391074816 @exported-using-ghexport Differential Revision: [D107576199](https://our.internmc.facebook.com/intern/diff/D107576199/)

Summary: Update the WebGPU backend README to reflect the current state of the backend: - Add a Progress section listing milestones landed on `main` (pytorch#18808, pytorch#19963, pytorch#19964, pytorch#19981, pytorch#20036) and work in review (pytorch#20079, pytorch#20080), each linking its pull request. - Update the operator support table to include `rms_norm` and refresh the planned/roadmap list toward end-to-end LLM inference. - Update the directory structure to match the current layout. Docs-only change; no code or build impact. Differential Revision: D107742574

[ghstack-poisoned]

Update

4e6c9b7

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 5, 2026 21:31

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 5, 2026

Update

4b3d035

[ghstack-poisoned]

meta-codesync Bot added the meta-exported label Jun 5, 2026

Update

6fe8f7c

[ghstack-poisoned]

Update

673cbcd

[ghstack-poisoned]

Update

7ed33ce

[ghstack-poisoned]

Update

66456bb

[ghstack-poisoned]

Update

bf324dc

[ghstack-poisoned]

Update

261ac47

[ghstack-poisoned]

JulianCloudNTH mentioned this pull request Jun 6, 2026

[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20087

Open

Update

5ebbe35

[ghstack-poisoned]

Update

2bd8e95

[ghstack-poisoned]

JulianCloudNTH mentioned this pull request Jun 8, 2026

Refresh backend README with progress timeline (#20115) #20115

Open

JulianCloudNTH added 4 commits June 8, 2026 22:26

Update

f8a0eba

[ghstack-poisoned]

Update

c139f63

[ghstack-poisoned]

Update

1f5f027

[ghstack-poisoned]

Update

b1af951

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][WebGPU] Add per-pass dispatch ordering + scratch buffer tests#20080

[ExecuTorch][WebGPU] Add per-pass dispatch ordering + scratch buffer tests#20080
JulianCloudNTH wants to merge 14 commits into
gh/JulianCloudNTH/15/basefrom
gh/JulianCloudNTH/15/head

JulianCloudNTH commented Jun 5, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 5, 2026

Uh oh!

JulianCloudNTH commented Jun 5, 2026

Uh oh!

claude Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JulianCloudNTH commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20080

❌ 1 Unclassified Failure

Uh oh!

github-actions Bot commented Jun 5, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 5, 2026

Uh oh!

claude Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Per-pass dispatch ordering + scratch buffer tests

Feedback

Positives

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JulianCloudNTH commented Jun 5, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 5, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 5, 2026 •

edited

Loading