Skip to content

[ROCm] Implement empty graph node support and safeguards for HIP command buffer#39417

Closed
phambinhfin wants to merge 2 commits intoopenxla:mainfrom
ROCm:ci_phambinh/rocm_implement_empty_node
Closed

[ROCm] Implement empty graph node support and safeguards for HIP command buffer#39417
phambinhfin wants to merge 2 commits intoopenxla:mainfrom
ROCm:ci_phambinh/rocm_implement_empty_node

Conversation

@phambinhfin
Copy link
Copy Markdown
Contributor

📋 Summary of Changes

  • Implement CreateEmptyNode for the ROCm command buffer backend using hipGraphAddEmptyNode, enabling empty graph nodes to serve as dependency synchronization points in HIP graphs.
  • Fix SetPriority to silently succeed on HIP, since the HIP runtime does not expose a per-node priority API. This allows collectives using StreamPriority::Highest to be captured into HIP graphs without errors.
  • Add empty traced graph detection in RocmCommandBuffer::Trace() to return a clear error when stream capture produces a graph with no operations, preventing opaque runtime crashes during graph instantiation.
  • Implement PrepareFinalization to insert an empty node into empty body graphs before finalization, ensuring graph instantiation remains safe when conditional body graphs are empty.

🎯 Justification

When command buffers are enabled for collectives on ROCm, the thunk emitter creates EmptyCmd nodes for collective-done thunks (AllReduceDone, AllGatherDone, etc.) and WaitForStreams thunks that have control predecessors. These flow down to CreateEmptyNode, which was returning UnimplementedError on ROCm, causing crashes. Additionally, SetPriority was returning UnimplementedError, which also crashed when collectives with StreamPriority::Highest were captured into HIP graphs.

🚀 Kind of Contribution

🐛 Bug Fix, ✨ New Feature, ✏️ Tests

✏️ Unit Tests

Four platform-agnostic test cases added to gpu_command_buffer_test.cc:

  • EmptyNodeWithKernel -- empty node as root with a dependent kernel launch
  • EmptyNodeOnly -- command buffer with a single empty node (exercises PrepareFinalization)
  • EmptyNodeChain -- chain of 3 empty nodes followed by a kernel (multi-hop dependency propagation)
  • EmptyNodeAsDependencyBarrier -- kernel -> empty -> kernel ordering, validating the core use case for collective synchronization

✏️ Execution Tests

Built and ran //xla/stream_executor/gpu:gpu_command_buffer_test with --config=rocm on MI300X:

[==========] Running 4 tests from 1 test suite.
[ RUN      ] GpuCommandBufferTest.EmptyNodeOnly
[       OK ] GpuCommandBufferTest.EmptyNodeOnly (1314 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeChain
[       OK ] GpuCommandBufferTest.EmptyNodeChain (15 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeWithKernel
[       OK ] GpuCommandBufferTest.EmptyNodeWithKernel (10 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier
[       OK ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier (9 ms)
[  PASSED  ] 4 tests.

Full suite: 8 passed, 0 failed, 35 skipped

ScXfjiang and others added 2 commits March 18, 2026 13:03
- Add hipGraphGetRootNodes to the ROCm driver wrapper
- Add empty traced graph detection in RocmCommandBuffer::Trace() to
  prevent crashes when a captured stream produces no operations,
  matching the existing CUDA guard
- Implement PrepareFinalization to insert an empty node into empty body
  graphs, preventing potential issues when graph conditionals are
  supported on HIP in the future
- Add four new platform-agnostic test cases for CreateEmptyNode:
  EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and
  EmptyNodeAsDependencyBarrier
@ezhulenev
Copy link
Copy Markdown
Contributor

FYI I'm working on removing that part from command buffers as a part of #39382, and WaitForStreams will be deleted after that. Though of course let's merge this one, as empty commands might be useful later for some other use cases.

@phambinhfin
Copy link
Copy Markdown
Contributor Author

@beckerhe @dimitar-asenov Could you also take a look so this can be merged.

copybara-service Bot pushed a commit that referenced this pull request Mar 30, 2026
…or HIP command buffer

Imported from GitHub PR #39417

## 📋 Summary of Changes

- Implement `CreateEmptyNode` for the ROCm command buffer backend using `hipGraphAddEmptyNode`, enabling empty graph nodes to serve as dependency synchronization points in HIP graphs.
- Fix `SetPriority` to silently succeed on HIP, since the HIP runtime does not expose a per-node priority API. This allows collectives using `StreamPriority::Highest` to be captured into HIP graphs without errors.
- Add empty traced graph detection in `RocmCommandBuffer::Trace()` to return a clear error when stream capture produces a graph with no operations, preventing opaque runtime crashes during graph instantiation.
- Implement `PrepareFinalization` to insert an empty node into empty body graphs before finalization, ensuring graph instantiation remains safe when conditional body graphs are empty.

## 🎯 Justification

When command buffers are enabled for collectives on ROCm, the thunk emitter creates `EmptyCmd` nodes for collective-done thunks (AllReduceDone, AllGatherDone, etc.) and WaitForStreams thunks that have control predecessors. These flow down to `CreateEmptyNode`, which was returning `UnimplementedError` on ROCm, causing crashes. Additionally, `SetPriority` was returning `UnimplementedError`, which also crashed when collectives with `StreamPriority::Highest` were captured into HIP graphs.

## 🚀 Kind of Contribution

🐛 Bug Fix, ✨ New Feature, ✏️ Tests

## ✏️ Unit Tests

Four platform-agnostic test cases added to `gpu_command_buffer_test.cc`:
- `EmptyNodeWithKernel` -- empty node as root with a dependent kernel launch
- `EmptyNodeOnly` -- command buffer with a single empty node (exercises `PrepareFinalization`)
- `EmptyNodeChain` -- chain of 3 empty nodes followed by a kernel (multi-hop dependency propagation)
- `EmptyNodeAsDependencyBarrier` -- `kernel -> empty -> kernel` ordering, validating the core use case for collective synchronization

## ✏️ Execution Tests

Built and ran `//xla/stream_executor/gpu:gpu_command_buffer_test` with `--config=rocm` on MI300X:

```
[==========] Running 4 tests from 1 test suite.
[ RUN      ] GpuCommandBufferTest.EmptyNodeOnly
[       OK ] GpuCommandBufferTest.EmptyNodeOnly (1314 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeChain
[       OK ] GpuCommandBufferTest.EmptyNodeChain (15 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeWithKernel
[       OK ] GpuCommandBufferTest.EmptyNodeWithKernel (10 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier
[       OK ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier (9 ms)
[  PASSED  ] 4 tests.

Full suite: 8 passed, 0 failed, 35 skipped
```

Copybara import of the project:

--
97993f2 by scxfjiang <sc.xfjiang@gmail.com>:

fix collective crash

--
64aada3 by Pham Binh <phambinh@amd.com>:

Add empty graph safeguards and tests for HIP command buffer

- Add hipGraphGetRootNodes to the ROCm driver wrapper
- Add empty traced graph detection in RocmCommandBuffer::Trace() to
  prevent crashes when a captured stream produces no operations,
  matching the existing CUDA guard
- Implement PrepareFinalization to insert an empty node into empty body
  graphs, preventing potential issues when graph conditionals are
  supported on HIP in the future
- Add four new platform-agnostic test cases for CreateEmptyNode:
  EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and
  EmptyNodeAsDependencyBarrier

Merging this change closes #39417

FUTURE_COPYBARA_INTEGRATE_REVIEW=#39417 from ROCm:ci_phambinh/rocm_implement_empty_node 64aada3
PiperOrigin-RevId: 891582170
copybara-service Bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 30, 2026
…or HIP command buffer

Imported from GitHub PR openxla/xla#39417

## 📋 Summary of Changes

- Implement `CreateEmptyNode` for the ROCm command buffer backend using `hipGraphAddEmptyNode`, enabling empty graph nodes to serve as dependency synchronization points in HIP graphs.
- Fix `SetPriority` to silently succeed on HIP, since the HIP runtime does not expose a per-node priority API. This allows collectives using `StreamPriority::Highest` to be captured into HIP graphs without errors.
- Add empty traced graph detection in `RocmCommandBuffer::Trace()` to return a clear error when stream capture produces a graph with no operations, preventing opaque runtime crashes during graph instantiation.
- Implement `PrepareFinalization` to insert an empty node into empty body graphs before finalization, ensuring graph instantiation remains safe when conditional body graphs are empty.

## 🎯 Justification

When command buffers are enabled for collectives on ROCm, the thunk emitter creates `EmptyCmd` nodes for collective-done thunks (AllReduceDone, AllGatherDone, etc.) and WaitForStreams thunks that have control predecessors. These flow down to `CreateEmptyNode`, which was returning `UnimplementedError` on ROCm, causing crashes. Additionally, `SetPriority` was returning `UnimplementedError`, which also crashed when collectives with `StreamPriority::Highest` were captured into HIP graphs.

## 🚀 Kind of Contribution

🐛 Bug Fix, ✨ New Feature, ✏️ Tests

## ✏️ Unit Tests

Four platform-agnostic test cases added to `gpu_command_buffer_test.cc`:
- `EmptyNodeWithKernel` -- empty node as root with a dependent kernel launch
- `EmptyNodeOnly` -- command buffer with a single empty node (exercises `PrepareFinalization`)
- `EmptyNodeChain` -- chain of 3 empty nodes followed by a kernel (multi-hop dependency propagation)
- `EmptyNodeAsDependencyBarrier` -- `kernel -> empty -> kernel` ordering, validating the core use case for collective synchronization

## ✏️ Execution Tests

Built and ran `//xla/stream_executor/gpu:gpu_command_buffer_test` with `--config=rocm` on MI300X:

```
[==========] Running 4 tests from 1 test suite.
[ RUN      ] GpuCommandBufferTest.EmptyNodeOnly
[       OK ] GpuCommandBufferTest.EmptyNodeOnly (1314 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeChain
[       OK ] GpuCommandBufferTest.EmptyNodeChain (15 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeWithKernel
[       OK ] GpuCommandBufferTest.EmptyNodeWithKernel (10 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier
[       OK ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier (9 ms)
[  PASSED  ] 4 tests.

Full suite: 8 passed, 0 failed, 35 skipped
```

Copybara import of the project:

--
97993f2bc987609f4a859ce1a516cabd54beb50d by scxfjiang <sc.xfjiang@gmail.com>:

fix collective crash

--
64aada3096869ec527b148b967029202c0998c9b by Pham Binh <phambinh@amd.com>:

Add empty graph safeguards and tests for HIP command buffer

- Add hipGraphGetRootNodes to the ROCm driver wrapper
- Add empty traced graph detection in RocmCommandBuffer::Trace() to
  prevent crashes when a captured stream produces no operations,
  matching the existing CUDA guard
- Implement PrepareFinalization to insert an empty node into empty body
  graphs, preventing potential issues when graph conditionals are
  supported on HIP in the future
- Add four new platform-agnostic test cases for CreateEmptyNode:
  EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and
  EmptyNodeAsDependencyBarrier

Merging this change closes #39417

FUTURE_COPYBARA_INTEGRATE_REVIEW=openxla/xla#39417 from ROCm:ci_phambinh/rocm_implement_empty_node 64aada3096869ec527b148b967029202c0998c9b
PiperOrigin-RevId: 891582170
@github-project-automation github-project-automation Bot moved this from Pending Google Review to Done in OpenXLA:GPU - AMD/ROCm Mar 30, 2026
copybara-service Bot pushed a commit to tensorflow/tensorflow that referenced this pull request Mar 30, 2026
…or HIP command buffer

Imported from GitHub PR openxla/xla#39417

## 📋 Summary of Changes

- Implement `CreateEmptyNode` for the ROCm command buffer backend using `hipGraphAddEmptyNode`, enabling empty graph nodes to serve as dependency synchronization points in HIP graphs.
- Fix `SetPriority` to silently succeed on HIP, since the HIP runtime does not expose a per-node priority API. This allows collectives using `StreamPriority::Highest` to be captured into HIP graphs without errors.
- Add empty traced graph detection in `RocmCommandBuffer::Trace()` to return a clear error when stream capture produces a graph with no operations, preventing opaque runtime crashes during graph instantiation.
- Implement `PrepareFinalization` to insert an empty node into empty body graphs before finalization, ensuring graph instantiation remains safe when conditional body graphs are empty.

## 🎯 Justification

When command buffers are enabled for collectives on ROCm, the thunk emitter creates `EmptyCmd` nodes for collective-done thunks (AllReduceDone, AllGatherDone, etc.) and WaitForStreams thunks that have control predecessors. These flow down to `CreateEmptyNode`, which was returning `UnimplementedError` on ROCm, causing crashes. Additionally, `SetPriority` was returning `UnimplementedError`, which also crashed when collectives with `StreamPriority::Highest` were captured into HIP graphs.

## 🚀 Kind of Contribution

🐛 Bug Fix, ✨ New Feature, ✏️ Tests

## ✏️ Unit Tests

Four platform-agnostic test cases added to `gpu_command_buffer_test.cc`:
- `EmptyNodeWithKernel` -- empty node as root with a dependent kernel launch
- `EmptyNodeOnly` -- command buffer with a single empty node (exercises `PrepareFinalization`)
- `EmptyNodeChain` -- chain of 3 empty nodes followed by a kernel (multi-hop dependency propagation)
- `EmptyNodeAsDependencyBarrier` -- `kernel -> empty -> kernel` ordering, validating the core use case for collective synchronization

## ✏️ Execution Tests

Built and ran `//xla/stream_executor/gpu:gpu_command_buffer_test` with `--config=rocm` on MI300X:

```
[==========] Running 4 tests from 1 test suite.
[ RUN      ] GpuCommandBufferTest.EmptyNodeOnly
[       OK ] GpuCommandBufferTest.EmptyNodeOnly (1314 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeChain
[       OK ] GpuCommandBufferTest.EmptyNodeChain (15 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeWithKernel
[       OK ] GpuCommandBufferTest.EmptyNodeWithKernel (10 ms)
[ RUN      ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier
[       OK ] GpuCommandBufferTest.EmptyNodeAsDependencyBarrier (9 ms)
[  PASSED  ] 4 tests.

Full suite: 8 passed, 0 failed, 35 skipped
```

Copybara import of the project:

--
97993f2bc987609f4a859ce1a516cabd54beb50d by scxfjiang <sc.xfjiang@gmail.com>:

fix collective crash

--
64aada3096869ec527b148b967029202c0998c9b by Pham Binh <phambinh@amd.com>:

Add empty graph safeguards and tests for HIP command buffer

- Add hipGraphGetRootNodes to the ROCm driver wrapper
- Add empty traced graph detection in RocmCommandBuffer::Trace() to
  prevent crashes when a captured stream produces no operations,
  matching the existing CUDA guard
- Implement PrepareFinalization to insert an empty node into empty body
  graphs, preventing potential issues when graph conditionals are
  supported on HIP in the future
- Add four new platform-agnostic test cases for CreateEmptyNode:
  EmptyNodeWithKernel, EmptyNodeOnly, EmptyNodeChain, and
  EmptyNodeAsDependencyBarrier

Merging this change closes #39417

PiperOrigin-RevId: 891593519
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AMD-GPU XLA on AMD GPU

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

5 participants