[ExecuTorch][WebGPU] Add GPU timestamp-query profiling (WebGPUQueryPool) by JulianCloudNTH · Pull Request #20167 · pytorch/executorch

JulianCloudNTH · 2026-06-09T21:16:24Z

Stack from ghstack (oldest at bottom):

-> [ExecuTorch][WebGPU] Add GPU timestamp-query profiling (WebGPUQueryPool) #20167
[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20087
[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20086
[ExecuTorch][WebGPU] SymInt live-scalar mechanism + et_vk.select_as_symint #20085
[ExecuTorch][WebGPU] Add update_cache tests (native numeric + export) #20084
[ExecuTorch][WebGPU] Add update_cache op (llama.update_cache) #20083
[ExecuTorch][WebGPU] Add per-pass dispatch ordering + scratch buffer tests #20080
[ExecuTorch][WebGPU] Switch native backend from wgpu-native to Dawn (Tint) + SwiftShader #20079

Add a faithful re-port of Vulkan's vkapi::QueryPool (backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}) so a bench can read true on-GPU per-kernel time, isolated from submit/readback latency — the basis for comparing the WGSL SDPA kernels against the Vulkan reference. Opt-in via the WEBGPU_TIMESTAMP_QUERY env var; off by default, so the production execute() path is byte-identical.

WebGPUQueryPool mirrors the Vulkan ShaderDuration data model and the ticks->ns conversion exactly. Three deviations are forced by the WebGPU API (not unforced divergences): per-dispatch bracketing uses a compute-pass timestampWrites descriptor (begin/end-of-pass) since WebGPU has no mid-encoder writeTimestamp; results are read via resolveQuerySet + buffer map (no host-side vkGetQueryPoolResults); and the TimestampQuery capability is requested as an explicit device feature (fail-open if the adapter lacks it). WebGPUGraph::execute() brackets each compute pass when the pool is active; chained update_cache/QK/softmax/AV dispatches carry a kernel_name label for attribution.

Co-authored-with Claude.
@exported-using-ghexport

Differential Revision: D107678235

[ghstack-poisoned]

pytorch-bot · 2026-06-09T21:16:27Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20167

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

⏳ No Failures, 3 Pending

As of commit 8a99e7a with merge base af92b60 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-06-09T21:17:12Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Pull Request resolved: #20167 Add a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) so a bench can read true on-GPU per-kernel time, isolated from submit/readback latency — the basis for comparing the WGSL SDPA kernels against the Vulkan reference. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. `WebGPUQueryPool` mirrors the Vulkan `ShaderDuration` data model and the ticks->ns conversion exactly. Three deviations are forced by the WebGPU API (not unforced divergences): per-dispatch bracketing uses a compute-pass `timestampWrites` descriptor (begin/end-of-pass) since WebGPU has no mid-encoder `writeTimestamp`; results are read via `resolveQuerySet` + buffer map (no host-side `vkGetQueryPoolResults`); and the `TimestampQuery` capability is requested as an explicit device feature (fail-open if the adapter lacks it). `WebGPUGraph::execute()` brackets each compute pass when the pool is active; chained `update_cache`/QK/softmax/AV dispatches carry a `kernel_name` label for attribution. Co-authored-with Claude. ghstack-source-id: 391669549 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 Add a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) so a bench can read true on-GPU per-kernel time, isolated from submit/readback latency — the basis for comparing the WGSL SDPA kernels against the Vulkan reference. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. `WebGPUQueryPool` mirrors the Vulkan `ShaderDuration` data model and the ticks->ns conversion exactly. Three deviations are forced by the WebGPU API (not unforced divergences): per-dispatch bracketing uses a compute-pass `timestampWrites` descriptor (begin/end-of-pass) since WebGPU has no mid-encoder `writeTimestamp`; results are read via `resolveQuerySet` + buffer map (no host-side `vkGetQueryPoolResults`); and the `TimestampQuery` capability is requested as an explicit device feature (fail-open if the adapter lacks it). `WebGPUGraph::execute()` brackets each compute pass when the pool is active; chained `update_cache`/QK/softmax/AV dispatches carry a `kernel_name` label for attribution. Co-authored-with Claude. ghstack-source-id: 391741952 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

Update

25c045f

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 9, 2026 21:16

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026

Update

8a99e7a

[ghstack-poisoned]

meta-codesync Bot added the meta-exported label Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ExecuTorch][WebGPU] Add GPU timestamp-query profiling (WebGPUQueryPool)#20167

[ExecuTorch][WebGPU] Add GPU timestamp-query profiling (WebGPUQueryPool)#20167
JulianCloudNTH wants to merge 2 commits into
gh/JulianCloudNTH/21/basefrom
gh/JulianCloudNTH/21/head

JulianCloudNTH commented Jun 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

JulianCloudNTH commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20167

⏳ No Failures, 3 Pending

Uh oh!

github-actions Bot commented Jun 9, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JulianCloudNTH commented Jun 9, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

This PR needs a `release notes:` label