Skip certain configs in LowerCollectiveCudaAndNcclTest #5743

wujingyue · 2026-01-02T02:07:04Z

... to speed up CI and local runs

The way forward could be to reduce warmup_iters and timing_iters and move this to benchmarks/cpp so it doesn't run by default.

wujingyue · 2026-01-02T02:07:59Z

tests/cpp/test_multidevice_lower_communication_cuda.cpp


 TEST_P(LowerCollectiveCudaAndNcclTest, Allgather) {
-  const auto& [msg_size_bytes, protocol_enum] = GetParam();
-  const int64_t kMsgSize = msg_size_bytes / sizeof(float);


Not a real constant: https://google.github.io/styleguide/cppguide.html#Constant_Names

wujingyue · 2026-01-02T02:08:14Z

tests/cpp/test_multidevice_lower_communication_cuda.cpp

  at::Tensor runBenchmark(
      MultiDeviceExecutor& executor,
      const std::vector<c10::IValue>& inputs,
-      int64_t msg_size_bytes,


https://google.github.io/styleguide/cppguide.html#General_Naming_Rules

wujingyue · 2026-01-02T02:08:51Z

tests/cpp/test_multidevice_lower_communication_cuda.cpp

  }

+  if (message_size_bytes > 32LL * 1024 * 1024) {
+    GTEST_SKIP() << "Takes >30 seconds to run in CI: http://nv/e.)";


wujingyue · 2026-01-02T02:08:59Z

tests/cpp/test_multidevice_lower_communication_cuda.cpp

  }

+  if (message_size_bytes > 32LL * 1024 * 1024) {
+    GTEST_SKIP() << "Takes >5 seconds to run in CI: http://nv/e.)";


... and here

greptile-apps · 2026-01-02T02:09:28Z

Greptile Summary

This PR reduces CI runtime by skipping large message size tests (128MB and 256MB) in LowerCollectiveCudaAndNcclTest. The skip logic is added to both Allgather and Broadcast tests, filtering out configurations where message_size_bytes > 32MB.

Skips 2 out of 5 message sizes (128MB, 256MB) for both test cases
Each skipped configuration would test 4 protocols (kMemcpy, kNccl, kMultimem, kBatchedMemcpy)
Total: 16 test configurations skipped (2 tests × 2 sizes × 4 protocols)
Refactors variable names from msg_size_bytes/kMsgSize to message_size_bytes/message_size for consistency
Contains malformed URLs in skip messages that should be corrected

Confidence Score: 4/5

Safe to merge after fixing the malformed URLs in skip messages
The change achieves its goal of reducing CI time by skipping slow-running large message size tests. The logic is sound and properly preserves test coverage for smaller sizes (2MB, 8MB, 32MB). Variable renaming improves code consistency. Minor syntax issue with incomplete URLs in skip messages needs correction before merge.
Fix malformed URLs in tests/cpp/test_multidevice_lower_communication_cuda.cpp at lines 197 and 265

Important Files Changed

Filename	Overview
tests/cpp/test_multidevice_lower_communication_cuda.cpp	Skips large message size tests (>32MB) for both `Allgather` and `Broadcast` to improve CI time, renames `msg_size_bytes` to `message_size_bytes` for consistency. Contains malformed URLs in skip messages.

Sequence Diagram

sequenceDiagram
    participant Test as Test Harness
    participant Allgather as Allgather Test
    participant Broadcast as Broadcast Test
    participant Skip as GTEST_SKIP
    
    Test->>Allgather: Execute with params (message_size, protocol)
    Allgather->>Allgather: Check message_size_bytes > 32MB?
    alt message_size > 32MB
        Allgather->>Skip: Skip test (128MB, 256MB configs)
        Note over Skip: "Takes >30 seconds in CI"
    else message_size <= 32MB
        Allgather->>Allgather: Run test (2MB, 8MB, 32MB)
    end
    
    Test->>Broadcast: Execute with params (message_size, protocol)
    Broadcast->>Broadcast: Check message_size_bytes > 32MB?
    alt message_size > 32MB
        Broadcast->>Skip: Skip test (128MB, 256MB configs)
        Note over Skip: "Takes >5 seconds in CI"
    else message_size <= 32MB
        Broadcast->>Broadcast: Run test (2MB, 8MB, 32MB)
    end

greptile-apps

_{1 file reviewed, 2 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-02T02:09:27Z

tests/cpp/test_multidevice_lower_communication_cuda.cpp

  }

+  if (message_size_bytes > 32LL * 1024 * 1024) {
+    GTEST_SKIP() << "Takes >30 seconds to run in CI: http://nv/e.)";


syntax: malformed URL in skip message - http://nv/e.) appears incomplete

Suggested change

GTEST_SKIP() << "Takes >30 seconds to run in CI: http://nv/e.)";

GTEST_SKIP() << "Takes >30 seconds to run in CI";

greptile-apps · 2026-01-02T02:09:27Z

tests/cpp/test_multidevice_lower_communication_cuda.cpp

  }

+  if (message_size_bytes > 32LL * 1024 * 1024) {
+    GTEST_SKIP() << "Takes >5 seconds to run in CI: http://nv/e.)";


syntax: malformed URL in skip message - http://nv/e.) appears incomplete

Suggested change

GTEST_SKIP() << "Takes >5 seconds to run in CI: http://nv/e.)";

GTEST_SKIP() << "Takes >5 seconds to run in CI";

github-actions · 2026-01-02T02:09:56Z

Description

Rename variables for clarity: msg_size_bytes → message_size_bytes, kMsgSize → message_size
Add skip conditions for large message sizes (>32MB) in Allgather and Broadcast tests
Skip Allgather tests taking >30 seconds and Broadcast tests taking >5 seconds in CI
Minor formatting improvements to break long lines for better readability

Changes walkthrough

	Relevant files

PR Reviewer Guide

Here are some key observations to aid the review process:

🧪 PR contains tests

⚡ Recommended focus areas for review

Incomplete skip reason URL

The skip messages reference "http://nv/e.)" which appears to be an incomplete or placeholder URL. This should be replaced with a proper reference or documentation link explaining why these specific message size thresholds were chosen.

GTEST_SKIP() << "Takes >30 seconds to run in CI: http://nv/e.)";

Inconsistent skip thresholds

Both Allgather and Broadcast tests use the same 32MB threshold but have different time estimates (>30s vs >5s). This inconsistency should be verified to ensure the thresholds are appropriate for each operation type.

  if (message_size_bytes > 32LL * 1024 * 1024) {
    GTEST_SKIP() << "Takes >30 seconds to run in CI: http://nv/e.)";
  }

  // cudaMemcpyBatchAsync requires a non-default stream
  c10::cuda::CUDAStream stream =
      c10::cuda::getStreamFromPool(/*isHighPriority=*/false);
  c10::cuda::setCurrentCUDAStream(stream);

  EnableOptionsGuard guard;
  setupProtocolOptions(protocol_enum, guard);

  auto fusion = std::make_unique<Fusion>();
  FusionGuard fg(fusion.get());

  const auto num_devices = communicator_->size();
  TensorView* in = makeContigTensor(2);
  TensorView* out = set(in);
  fusion->addInput(in);
  fusion->addOutput(out);

  if (backend_type == CommunicatorBackend::kCuda) {
    out->setMemoryType(MemoryType::Symmetric);
  }

  auto mesh = DeviceMesh::createForNumDevices(num_devices);
  in->setDeviceMesh(mesh);
  out->setDeviceMesh(mesh);
  in->axis(0)->parallelize(ParallelType::DIDx);

  at::Tensor unsharded_tensor =
      at::randn({num_devices, message_size}, tensor_options_);
  at::Tensor in_tensor = shardTensor(unsharded_tensor, in);

  MultiDeviceExecutorParams params;
  params.lower.communicator_backend = backend_type;
  params.executor.use_allocation_cache = true;
  MultiDeviceExecutor executor(
      std::move(fusion), Communicator::getInstance(), params);

  // Run benchmark and validate correctness
  at::Tensor out_tensor = runBenchmark(
      executor,
      {in_tensor},
      message_size_bytes,
      backend_type,
      "Allgather/" + protocol_str,
      static_cast<float>(communicator_->size()));

  EXPECT_TRUE(at::allclose(out_tensor, unsharded_tensor));
}

TEST_P(LowerCollectiveCudaAndNcclTest, Broadcast) {
  const auto& [message_size_bytes, protocol_enum] = GetParam();
  const CommunicatorBackend backend_type = getBackend(protocol_enum);
  const std::string protocol_str = getProtocolString(protocol_enum);
  const int64_t message_size = message_size_bytes / sizeof(float);

  if (!communicator_->is_available() || communicator_->size() < 2) {
    GTEST_SKIP() << "This test needs at least 2 ranks.";
  }

  if (!isMulticastSupported() &&
      (protocol_enum == CommunicationProtocol::kMemcpy ||
       protocol_enum == CommunicationProtocol::kMultimem)) {
    GTEST_SKIP() << "Device does not support Multicast; skipping.";
  }

  if (message_size_bytes > 32LL * 1024 * 1024) {
    GTEST_SKIP() << "Takes >5 seconds to run in CI: http://nv/e.)";
  }

Skip certain configs in LowerCollectiveCudaAndNcclTest to speed up CI

7eeae41

wujingyue commented Jan 2, 2026

View reviewed changes

wujingyue requested a review from mdavis36 January 2, 2026 02:09

greptile-apps bot reviewed Jan 2, 2026

View reviewed changes

wujingyue changed the title ~~Skip certain configs in LowerCollectiveCudaAndNcclTest to speed up CI~~ Skip certain configs in LowerCollectiveCudaAndNcclTest Jan 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skip certain configs in LowerCollectiveCudaAndNcclTest #5743

Skip certain configs in LowerCollectiveCudaAndNcclTest #5743

Uh oh!

wujingyue commented Jan 2, 2026 •

edited

Loading

Uh oh!

wujingyue Jan 2, 2026

Uh oh!

wujingyue Jan 2, 2026

Uh oh!

wujingyue Jan 2, 2026

Uh oh!

wujingyue Jan 2, 2026

Uh oh!

greptile-apps bot commented Jan 2, 2026

Uh oh!

greptile-apps bot left a comment

Uh oh!

greptile-apps bot Jan 2, 2026

Uh oh!

greptile-apps bot Jan 2, 2026

Uh oh!

github-actions bot commented Jan 2, 2026

Changes walkthrough

PR Reviewer Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	GTEST_SKIP() << "Takes >30 seconds to run in CI: http://nv/e.)";
	GTEST_SKIP() << "Takes >30 seconds to run in CI";

	GTEST_SKIP() << "Takes >5 seconds to run in CI: http://nv/e.)";
	GTEST_SKIP() << "Takes >5 seconds to run in CI";

Skip certain configs in LowerCollectiveCudaAndNcclTest #5743

Are you sure you want to change the base?

Skip certain configs in LowerCollectiveCudaAndNcclTest #5743

Uh oh!

Conversation

wujingyue commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wujingyue Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot commented Jan 2, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Jan 2, 2026

Description

Changes walkthrough

PR Reviewer Guide

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wujingyue commented Jan 2, 2026 •

edited

Loading