[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache by JulianCloudNTH · Pull Request #20087 · pytorch/executorch

JulianCloudNTH · 2026-06-06T07:15:01Z

Stack from ghstack (oldest at bottom):

-> [ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20087
[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20086
[ExecuTorch][WebGPU] SymInt live-scalar mechanism + et_vk.select_as_symint #20085
[ExecuTorch][WebGPU] Add update_cache tests (native numeric + export) #20084
[ExecuTorch][WebGPU] Add update_cache op (llama.update_cache) #20083
[ExecuTorch][WebGPU] Add per-pass dispatch ordering + scratch buffer tests #20080
[ExecuTorch][WebGPU] Switch native backend from wgpu-native to Dawn (Tint) + SwiftShader #20079
[ExecuTorch][WebGPU] Graph-owned scratch buffers for fused-op intermediates #20073
[ExecuTorch][WebGPU] Per-pass compute dispatch ordering for fused multi-dispatch ops #20072

Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-input_pos consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic input_pos (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch F.scaled_dot_product_attention golden.

test/ops/sdpa/test_sdpa.py: ReplaySeq/REPLAY_SEQS + per-step replay export/golden; DynamicSdpaModule + export_dynamic_decode (one .pte, input_pos supplied at runtime as a SymInt); DecodeCacheModule + export_incache_decode (KV cache as register_buffer mutable buffers, so the cache persists in-graph and forward() feeds only the new token + input_pos).
test/test_webgpu_native.cpp: test_sdpa_replay, test_sdpa_dynamic_decode (+ negative control: a pinned input_pos diverges), test_sdpa_incache_decode (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), test_symint_roundtrip, test_resize_hook; shared per-element tolerance sdpa_within_tol (abs 1e-4 OR rel 1e-3).
test/test_build_webgpu.sh: export the replay / dynamic / in-graph-cache models for the native test.
@exported-using-ghexport

Differential Revision: D107595144

[ghstack-poisoned]

pytorch-bot · 2026-06-06T07:15:05Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20087

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 Unclassified Failure

As of commit 9656d0e with merge base ff2bf9c ():

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

Test WebGPU Native (Dawn) / test-webgpu-native / linux-job (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)
/pytorch/executorch/backends/webgpu/test/native/test_scratch_buffer.cpp:67:3: error: use of undeclared identifier 'webgpu_wait'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-06T07:15:43Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. ghstack-source-id: 391352764 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. ghstack-source-id: 391352764 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. ghstack-source-id: 391373155 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

[ghstack-poisoned]

…-graph KV cache Pull Request resolved: #20087 Adds the WebGPU SDPA test coverage as its own diff, stacked on the SDPA op (which already carries the dynamic-`input_pos` consumption) and the SymInt mechanism below it: multi-step prefill->mt->decode replay, runtime-dynamic `input_pos` (autoregressive decode), and an in-graph mutable KV cache, each compared against a torch `F.scaled_dot_product_attention` golden. - `test/ops/sdpa/test_sdpa.py`: `ReplaySeq`/`REPLAY_SEQS` + per-step replay export/golden; `DynamicSdpaModule` + `export_dynamic_decode` (one `.pte`, `input_pos` supplied at runtime as a SymInt); `DecodeCacheModule` + `export_incache_decode` (KV cache as `register_buffer` mutable buffers, so the cache persists in-graph and forward() feeds only the new token + `input_pos`). - `test/test_webgpu_native.cpp`: `test_sdpa_replay`, `test_sdpa_dynamic_decode` (+ negative control: a pinned `input_pos` diverges), `test_sdpa_incache_decode` (+ static control: a fresh Module per step diverges, proving in-graph accumulation is real), `test_symint_roundtrip`, `test_resize_hook`; shared per-element tolerance `sdpa_within_tol` (abs 1e-4 OR rel 1e-3). - `test/test_build_webgpu.sh`: export the replay / dynamic / in-graph-cache models for the native test. ghstack-source-id: 391378806 @exported-using-ghexport Differential Revision: [D107595144](https://our.internmc.facebook.com/intern/diff/D107595144/)

Update

a6ccd86

[ghstack-poisoned]

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 6, 2026

JulianCloudNTH closed this Jun 6, 2026

JulianCloudNTH had a problem deploying to cherry-pick-bot June 6, 2026 07:16 — with GitHub Actions Failure

JulianCloudNTH reopened this Jun 9, 2026

Update

dc3123d

[ghstack-poisoned]

meta-codesync Bot added the meta-exported label Jun 9, 2026

Update

493b5cc

[ghstack-poisoned]

Update

5c5a727

[ghstack-poisoned]

Update

9656d0e

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache#20087

[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache#20087
JulianCloudNTH wants to merge 5 commits into
gh/JulianCloudNTH/20/basefrom
gh/JulianCloudNTH/20/head

JulianCloudNTH commented Jun 6, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 6, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JulianCloudNTH commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20087

❌ 1 Unclassified Failure

Uh oh!

github-actions Bot commented Jun 6, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JulianCloudNTH commented Jun 6, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 6, 2026 •

edited

Loading

This PR needs a `release notes:` label