[ROCm] Implement empty graph node support and safeguards for HIP command buffer#39417
Closed
phambinhfin wants to merge 2 commits intoopenxla:mainfrom
Closed
[ROCm] Implement empty graph node support and safeguards for HIP command buffer#39417phambinhfin wants to merge 2 commits intoopenxla:mainfrom
phambinhfin wants to merge 2 commits intoopenxla:mainfrom
Conversation
- Add hipGraphGetRootNodes to the ROCm driver wrapper - Add empty traced graph detection in RocmCommandBuffer::Trace() to prevent crashes when a captured stream produces no operations, matching the existing CUDA guard - Implement PrepareFinalization to insert an empty node into empty body graphs, preventing potential issues when graph conditionals are supported on HIP in the future - Add four new platform-agnostic test cases for CreateEmptyNode: EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and EmptyNodeAsDependencyBarrier
Contributor
|
FYI I'm working on removing that part from command buffers as a part of #39382, and |
ezhulenev
approved these changes
Mar 19, 2026
Contributor
Author
|
@beckerhe @dimitar-asenov Could you also take a look so this can be merged. |
dimitar-asenov
approved these changes
Mar 30, 2026
copybara-service Bot
pushed a commit
that referenced
this pull request
Mar 30, 2026
…or HIP command buffer Imported from GitHub PR #39417 ## 📋 Summary of Changes - Implement `CreateEmptyNode` for the ROCm command buffer backend using `hipGraphAddEmptyNode`, enabling empty graph nodes to serve as dependency synchronization points in HIP graphs. - Fix `SetPriority` to silently succeed on HIP, since the HIP runtime does not expose a per-node priority API. This allows collectives using `StreamPriority::Highest` to be captured into HIP graphs without errors. - Add empty traced graph detection in `RocmCommandBuffer::Trace()` to return a clear error when stream capture produces a graph with no operations, preventing opaque runtime crashes during graph instantiation. - Implement `PrepareFinalization` to insert an empty node into empty body graphs before finalization, ensuring graph instantiation remains safe when conditional body graphs are empty. ## 🎯 Justification When command buffers are enabled for collectives on ROCm, the thunk emitter creates `EmptyCmd` nodes for collective-done thunks (AllReduceDone, AllGatherDone, etc.) and WaitForStreams thunks that have control predecessors. These flow down to `CreateEmptyNode`, which was returning `UnimplementedError` on ROCm, causing crashes. Additionally, `SetPriority` was returning `UnimplementedError`, which also crashed when collectives with `StreamPriority::Highest` were captured into HIP graphs. ## 🚀 Kind of Contribution 🐛 Bug Fix, ✨ New Feature, ✏️ Tests ## ✏️ Unit Tests Four platform-agnostic test cases added to `gpu_command_buffer_test.cc`: - `EmptyNodeWithKernel` -- empty node as root with a dependent kernel launch - `EmptyNodeOnly` -- command buffer with a single empty node (exercises `PrepareFinalization`) - `EmptyNodeChain` -- chain of 3 empty nodes followed by a kernel (multi-hop dependency propagation) - `EmptyNodeAsDependencyBarrier` -- `kernel -> empty -> kernel` ordering, validating the core use case for collective synchronization ## ✏️ Execution Tests Built and ran `//xla/stream_executor/gpu:gpu_command_buffer_test` with `--config=rocm` on MI300X: ``` [==========] Running 4 tests from 1 test suite. [ RUN ] GpuCommandBufferTest.EmptyNodeOnly [ OK ] GpuCommandBufferTest.EmptyNodeOnly (1314 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeChain [ OK ] GpuCommandBufferTest.EmptyNodeChain (15 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeWithKernel [ OK ] GpuCommandBufferTest.EmptyNodeWithKernel (10 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier [ OK ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier (9 ms) [ PASSED ] 4 tests. Full suite: 8 passed, 0 failed, 35 skipped ``` Copybara import of the project: -- 97993f2 by scxfjiang <sc.xfjiang@gmail.com>: fix collective crash -- 64aada3 by Pham Binh <phambinh@amd.com>: Add empty graph safeguards and tests for HIP command buffer - Add hipGraphGetRootNodes to the ROCm driver wrapper - Add empty traced graph detection in RocmCommandBuffer::Trace() to prevent crashes when a captured stream produces no operations, matching the existing CUDA guard - Implement PrepareFinalization to insert an empty node into empty body graphs, preventing potential issues when graph conditionals are supported on HIP in the future - Add four new platform-agnostic test cases for CreateEmptyNode: EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and EmptyNodeAsDependencyBarrier Merging this change closes #39417 FUTURE_COPYBARA_INTEGRATE_REVIEW=#39417 from ROCm:ci_phambinh/rocm_implement_empty_node 64aada3 PiperOrigin-RevId: 891582170
copybara-service Bot
pushed a commit
to tensorflow/tensorflow
that referenced
this pull request
Mar 30, 2026
…or HIP command buffer Imported from GitHub PR openxla/xla#39417 ## 📋 Summary of Changes - Implement `CreateEmptyNode` for the ROCm command buffer backend using `hipGraphAddEmptyNode`, enabling empty graph nodes to serve as dependency synchronization points in HIP graphs. - Fix `SetPriority` to silently succeed on HIP, since the HIP runtime does not expose a per-node priority API. This allows collectives using `StreamPriority::Highest` to be captured into HIP graphs without errors. - Add empty traced graph detection in `RocmCommandBuffer::Trace()` to return a clear error when stream capture produces a graph with no operations, preventing opaque runtime crashes during graph instantiation. - Implement `PrepareFinalization` to insert an empty node into empty body graphs before finalization, ensuring graph instantiation remains safe when conditional body graphs are empty. ## 🎯 Justification When command buffers are enabled for collectives on ROCm, the thunk emitter creates `EmptyCmd` nodes for collective-done thunks (AllReduceDone, AllGatherDone, etc.) and WaitForStreams thunks that have control predecessors. These flow down to `CreateEmptyNode`, which was returning `UnimplementedError` on ROCm, causing crashes. Additionally, `SetPriority` was returning `UnimplementedError`, which also crashed when collectives with `StreamPriority::Highest` were captured into HIP graphs. ## 🚀 Kind of Contribution 🐛 Bug Fix, ✨ New Feature, ✏️ Tests ## ✏️ Unit Tests Four platform-agnostic test cases added to `gpu_command_buffer_test.cc`: - `EmptyNodeWithKernel` -- empty node as root with a dependent kernel launch - `EmptyNodeOnly` -- command buffer with a single empty node (exercises `PrepareFinalization`) - `EmptyNodeChain` -- chain of 3 empty nodes followed by a kernel (multi-hop dependency propagation) - `EmptyNodeAsDependencyBarrier` -- `kernel -> empty -> kernel` ordering, validating the core use case for collective synchronization ## ✏️ Execution Tests Built and ran `//xla/stream_executor/gpu:gpu_command_buffer_test` with `--config=rocm` on MI300X: ``` [==========] Running 4 tests from 1 test suite. [ RUN ] GpuCommandBufferTest.EmptyNodeOnly [ OK ] GpuCommandBufferTest.EmptyNodeOnly (1314 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeChain [ OK ] GpuCommandBufferTest.EmptyNodeChain (15 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeWithKernel [ OK ] GpuCommandBufferTest.EmptyNodeWithKernel (10 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier [ OK ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier (9 ms) [ PASSED ] 4 tests. Full suite: 8 passed, 0 failed, 35 skipped ``` Copybara import of the project: -- 97993f2bc987609f4a859ce1a516cabd54beb50d by scxfjiang <sc.xfjiang@gmail.com>: fix collective crash -- 64aada3096869ec527b148b967029202c0998c9b by Pham Binh <phambinh@amd.com>: Add empty graph safeguards and tests for HIP command buffer - Add hipGraphGetRootNodes to the ROCm driver wrapper - Add empty traced graph detection in RocmCommandBuffer::Trace() to prevent crashes when a captured stream produces no operations, matching the existing CUDA guard - Implement PrepareFinalization to insert an empty node into empty body graphs, preventing potential issues when graph conditionals are supported on HIP in the future - Add four new platform-agnostic test cases for CreateEmptyNode: EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and EmptyNodeAsDependencyBarrier Merging this change closes #39417 FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#39417 from ROCm:ci_phambinh/rocm_implement_empty_node 64aada3096869ec527b148b967029202c0998c9b PiperOrigin-RevId: 891582170
copybara-service Bot
pushed a commit
to tensorflow/tensorflow
that referenced
this pull request
Mar 30, 2026
…or HIP command buffer Imported from GitHub PR openxla/xla#39417 ## 📋 Summary of Changes - Implement `CreateEmptyNode` for the ROCm command buffer backend using `hipGraphAddEmptyNode`, enabling empty graph nodes to serve as dependency synchronization points in HIP graphs. - Fix `SetPriority` to silently succeed on HIP, since the HIP runtime does not expose a per-node priority API. This allows collectives using `StreamPriority::Highest` to be captured into HIP graphs without errors. - Add empty traced graph detection in `RocmCommandBuffer::Trace()` to return a clear error when stream capture produces a graph with no operations, preventing opaque runtime crashes during graph instantiation. - Implement `PrepareFinalization` to insert an empty node into empty body graphs before finalization, ensuring graph instantiation remains safe when conditional body graphs are empty. ## 🎯 Justification When command buffers are enabled for collectives on ROCm, the thunk emitter creates `EmptyCmd` nodes for collective-done thunks (AllReduceDone, AllGatherDone, etc.) and WaitForStreams thunks that have control predecessors. These flow down to `CreateEmptyNode`, which was returning `UnimplementedError` on ROCm, causing crashes. Additionally, `SetPriority` was returning `UnimplementedError`, which also crashed when collectives with `StreamPriority::Highest` were captured into HIP graphs. ## 🚀 Kind of Contribution 🐛 Bug Fix, ✨ New Feature, ✏️ Tests ## ✏️ Unit Tests Four platform-agnostic test cases added to `gpu_command_buffer_test.cc`: - `EmptyNodeWithKernel` -- empty node as root with a dependent kernel launch - `EmptyNodeOnly` -- command buffer with a single empty node (exercises `PrepareFinalization`) - `EmptyNodeChain` -- chain of 3 empty nodes followed by a kernel (multi-hop dependency propagation) - `EmptyNodeAsDependencyBarrier` -- `kernel -> empty -> kernel` ordering, validating the core use case for collective synchronization ## ✏️ Execution Tests Built and ran `//xla/stream_executor/gpu:gpu_command_buffer_test` with `--config=rocm` on MI300X: ``` [==========] Running 4 tests from 1 test suite. [ RUN ] GpuCommandBufferTest.EmptyNodeOnly [ OK ] GpuCommandBufferTest.EmptyNodeOnly (1314 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeChain [ OK ] GpuCommandBufferTest.EmptyNodeChain (15 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeWithKernel [ OK ] GpuCommandBufferTest.EmptyNodeWithKernel (10 ms) [ RUN ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier [ OK ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier (9 ms) [ PASSED ] 4 tests. Full suite: 8 passed, 0 failed, 35 skipped ``` Copybara import of the project: -- 97993f2bc987609f4a859ce1a516cabd54beb50d by scxfjiang <sc.xfjiang@gmail.com>: fix collective crash -- 64aada3096869ec527b148b967029202c0998c9b by Pham Binh <phambinh@amd.com>: Add empty graph safeguards and tests for HIP command buffer - Add hipGraphGetRootNodes to the ROCm driver wrapper - Add empty traced graph detection in RocmCommandBuffer::Trace() to prevent crashes when a captured stream produces no operations, matching the existing CUDA guard - Implement PrepareFinalization to insert an empty node into empty body graphs, preventing potential issues when graph conditionals are supported on HIP in the future - Add four new platform-agnostic test cases for CreateEmptyNode: EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and EmptyNodeAsDependencyBarrier Merging this change closes #39417 PiperOrigin-RevId: 891593519
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
📋 Summary of Changes
CreateEmptyNodefor the ROCm command buffer backend usinghipGraphAddEmptyNode, enabling empty graph nodes to serve as dependency synchronization points in HIP graphs.SetPriorityto silently succeed on HIP, since the HIP runtime does not expose a per-node priority API. This allows collectives usingStreamPriority::Highestto be captured into HIP graphs without errors.RocmCommandBuffer::Trace()to return a clear error when stream capture produces a graph with no operations, preventing opaque runtime crashes during graph instantiation.PrepareFinalizationto insert an empty node into empty body graphs before finalization, ensuring graph instantiation remains safe when conditional body graphs are empty.🎯 Justification
When command buffers are enabled for collectives on ROCm, the thunk emitter creates
EmptyCmdnodes for collective-done thunks (AllReduceDone, AllGatherDone, etc.) and WaitForStreams thunks that have control predecessors. These flow down toCreateEmptyNode, which was returningUnimplementedErroron ROCm, causing crashes. Additionally,SetPrioritywas returningUnimplementedError, which also crashed when collectives withStreamPriority::Highestwere captured into HIP graphs.🚀 Kind of Contribution
🐛 Bug Fix, ✨ New Feature, ✏️ Tests
✏️ Unit Tests
Four platform-agnostic test cases added to
gpu_command_buffer_test.cc:EmptyNodeWithKernel-- empty node as root with a dependent kernel launchEmptyNodeOnly-- command buffer with a single empty node (exercisesPrepareFinalization)EmptyNodeChain-- chain of 3 empty nodes followed by a kernel (multi-hop dependency propagation)EmptyNodeAsDependencyBarrier--kernel -> empty -> kernelordering, validating the core use case for collective synchronization✏️ Execution Tests
Built and ran
//xla/stream_executor/gpu:gpu_command_buffer_testwith--config=rocmon MI300X: