Skip to content

Use hints in CCL and X#432

Merged
mawad-amd merged 11 commits intomainfrom
muhaawad/hints
Mar 8, 2026
Merged

Use hints in CCL and X#432
mawad-amd merged 11 commits intomainfrom
muhaawad/hints

Conversation

@mawad-amd
Copy link
Collaborator

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

@mawad-amd mawad-amd requested review from BKP and neoblizz as code owners March 6, 2026 23:31
Copilot AI review requested due to automatic review settings March 6, 2026 23:31
@github-actions github-actions bot added in-progress We are working on it iris Iris project issue labels Mar 6, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds hint parameters to IRIS collective communication store operations in both ccl and x implementations to enable hint-aware behavior.

Changes:

  • Pass hint=(BLOCK_SIZE_M, BLOCK_SIZE_N) to iris.store(...) across CCL collectives.
  • Pass hint=(tile.block_m, tile.block_n) to iris.store(...) in X collectives.
  • Reformat one iris.store(...) call site while adding the new hint argument.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
iris/x/all_reduce.py Adds hint to iris.store call in all_reduce_two_shot.
iris/x/all_gather.py Adds hint to iris.store call in all_gather.
iris/ccl/all_to_all.py Adds hint to iris.store in both fast and masked paths.
iris/ccl/all_reduce.py Adds hint to multiple iris.store sites; reformats one call.
iris/ccl/all_gather.py Adds hint to iris.store for gather implementations.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Contributor

Copilot AI commented Mar 6, 2026

@mawad-amd I've opened a new pull request, #433, to work on those changes. Once the pull request is ready, I'll request review from you.

@mawad-amd
Copy link
Collaborator Author

@copilot Please investigate this bug: the Test x (all_gather) jobs are failing on PR 432. In tests/x/test_all_gather.py line 163, Iris x.all_gather output doesn't match PyTorch's all_gather for shape 128-64-64-32 (max difference 1.0). All 6 failing cases are that shape (dtype float16/float32/bfloat16); larger shapes pass. You have access to AMD GPUs, so you should be able to run the tests and debug locally. Focus on why this specific shape triggers the mismatch (e.g. hints, rank layout, or kernel path for small dims).

Copy link
Contributor

Copilot AI commented Mar 7, 2026

@mawad-amd I've opened a new pull request, #434, to work on those changes. Once the pull request is ready, I'll request review from you.

@mawad-amd
Copy link
Collaborator Author

@copilot I merged #434 Can you use gh to understand the comments there and suggest tests to make sure all APIs are tested against that behavior? Maybe the existing tests are enough but we need some new shapes in the parametrized tests

Copy link
Contributor

Copilot AI commented Mar 7, 2026

@mawad-amd I've opened a new pull request, #435, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 2 commits March 7, 2026 15:40
@mawad-amd
Copy link
Collaborator Author

@copilot Can you add additional tests for the other APIs we added hints to, in a similar fashion to what PR #435 did for x.all_gather?

PR #435 extended tests/x/test_all_gather.py with partial-width tile shapes that exercise the hint behavior:

  • (128, 128, 64, 32) — BLOCK_N < N/world_size (multi-block partial-width)
  • (256, 128, 64, 16) — minimum BLOCK_N=16 (16-bit vectorization path)

We added hints in this PR (432) to:

Please add analogous partial-width / edge-case test coverage for all_reduce and all_to_all (and any other CCL/X tests that exist for these ops), so we have the same kind of regression coverage for the hint usage there.

Copy link
Contributor

Copilot AI commented Mar 7, 2026

@mawad-amd I've opened a new pull request, #436, to work on those changes. Once the pull request is ready, I'll request review from you.

Copilot AI and others added 2 commits March 7, 2026 16:15
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
@mawad-amd
Copy link
Collaborator Author

@copilot Please address the following:

Other hints may need another look. CI is failing on CCL and X multi-rank tests.

Errors we saw:

  • Failing jobs: Test ccl (2/4/8 ranks) and Test x (2/4/8 ranks), git install.
  • All failures are in tests/ccl/test_all_gather.py for the 256-128-32-16 shape/block only (all dtypes), in both test_all_gather and test_all_gather_partitioned.
  • Rank 1: Iris output has a row of 0s where PyTorch has 1s — [1., 0., 0., ...] vs [1., 1., 1., ...] (max diff 1.0).
  • Rank 0: Iris output has rows of 0s where PyTorch has 2s — [2., 0., 0., ...] vs [2., 2., 2., ...] (max diff 2.0).

So for shape 256×128, block 32×16, Iris all_gather doesn't match PyTorch's all_gather_into_tensor; some elements stay zero. Likely a hint/block indexing or masking bug for this config — please check other hints for similar edge cases.

Copy link
Contributor

Copilot AI commented Mar 7, 2026

@mawad-amd I've opened a new pull request, #437, to work on those changes. Once the pull request is ready, I'll request review from you.

…cross all CCL/X collectives (#437)

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: mawad-amd <112003944+mawad-amd@users.noreply.github.com>
@mawad-amd mawad-amd merged commit 2afde85 into main Mar 8, 2026
75 checks passed
@mawad-amd mawad-amd deleted the muhaawad/hints branch March 8, 2026 04:41
@mawad-amd mawad-amd restored the muhaawad/hints branch March 8, 2026 04:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

in-progress We are working on it iris Iris project issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants