Skip to content

Conversation

@wujingyue
Copy link
Collaborator

No description provided.

@wujingyue
Copy link
Collaborator Author

!test

@github-actions
Copy link

github-actions bot commented Jan 2, 2026

Review updated until commit b9b2c19

Description

  • Simplify getCUDAStream implementation using try_emplace and removing explicit device index

  • Add new AssignStreams pass for stream parallelism in host IR

  • Remove HostIrLowering option from C++ StreamTest and move end-to-end tests to Python

  • Clean up includes and remove unused dependencies across multiple files

Changes walkthrough

Relevant files
Enhancement
7 files
assign_streams.cpp
Implement AssignStreams pass for stream parallelism           
+63/-0   
evaluator.cpp
Simplify getCUDAStream with try_emplace                                   
+7/-14   
passes.cpp
Add AssignStreams to optimization passes                                 
+2/-0     
allocate_and_deallocate.h
Remove unnecessary include                                                             
+0/-1     
assign_streams.h
Define AssignStreams optimization pass                                     
+26/-0   
ir.h
Remove scheduler heuristic include                                             
+0/-1     
communicator.h
Remove exceptions include                                                               
+0/-1     
Tests
3 files
test_multidevice_lower_communication_cuda.cpp
Remove large memory size test cases                                           
+1/-3     
test_stream.cpp
Remove HostIrLowering option and add documentation             
+4/-6     
test_stream.py
Remove nvfuser_direct_test parameter from tests                   
+3/-3     
Documentation
1 files
internal_nodes.h
Add documentation comment for Scope::insert                           
+1/-1     
Configuration changes
1 files
CMakeLists.txt
Add new source files to build                                                       
+2/-1     

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review
Stream Creation Simplification

The getCUDAStream method has been significantly simplified by using try_emplace and removing explicit device index specification when creating streams. This changes the stream creation behavior from explicitly specifying device index to using the default device. While this simplifies the code, it should be verified that this doesn't break multi-GPU scenarios where streams need to be created on specific devices.

c10::cuda::CUDAStream HostIrEvaluator::getCUDAStream(Stream* stream) {
  StreamKey stream_key = stream;
  // if stream points to an index, it represents the dynamic value of that index
  if (Val* index = stream->index(); index != nullptr) {
    auto value = expr_evaluator_.evaluate(index);
    NVF_ERROR(value.hasValue() && value.is<int64_t>());
    stream_key = value.as<int64_t>();
  }

  auto [it, inserted] =
      streams_.try_emplace(stream_key, c10::cuda::getStreamFromPool());
  return it->second;
}
Reduced Test Coverage

Test parameters were reduced from 4 values (2MB, 8MB, 32MB, 128MB, 256MB) to 3 values (2MB, 8MB, 32MB), removing the 128MB and 256MB test cases. This reduces test coverage for larger memory sizes and should be justified or the removed test cases should be restored if they provide important coverage.

INSTANTIATE_TEST_SUITE_P(
    ,
    LowerCollectiveCudaAndNcclTest,
    testing::Combine(
        testing::Values(
            2 * 1024 * 1024LL, // 2 MB
            8 * 1024 * 1024LL, // 8 MB
            32 * 1024 * 1024LL // 32 MB
            ),
        testing::Values(
            CommunicationProtocol::kMemcpy,
            CommunicationProtocol::kNccl,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants