Conversation
2c3fb2b to
b18372b
Compare
PR #1453 Review: "Address RDMA GPU locks"Summary of ChangesThis PR addresses GPU lock issues in RDMA by switching the macOS MLX dependency from the upstream release (ml-explore/mlx v0.30.6) to a custom fork (rltakashige/mlx-jaccl-fix-small-recv, branch address-rdma-gpu-locks) at commit d94f81a2. It also makes several code changes to improve process shutdown and move the distributed barrier. Note: The PR template (Motivation, Changes, Why It Works, Test Plan) is entirely empty. Files Changed (8)
Detailed Analysis1. MLX Fork Switch (pyproject.toml, uv.lock, nix/mlx.nix, flake.nix)
Concerns:
2. Barrier Move (generate.py)
Concern: 3. Process Shutdown (bootstrap.py, channels.py)
Concerns:
4. README.md
Overall AssessmentPositives:
Concerns:
|
Code Review — PR #1453: Address rdma gpu locksCI status: All checks passing (typecheck, aarch64-darwin, x86_64-linux, aarch64-linux) OverviewThis PR switches the macOS MLX dependency from the upstream PyPI release (v0.30.6) to a custom fork ( Critical Issues1. Merge conflicts — branch is stale and incompatible with The PR branch is based on an older version of
The PR branch has none of these. This must be rebased and the shutdown changes re-applied to the current 2. Empty PR description The Motivation, Changes, Why It Works, and Test Plan sections are all blank. For a PR that switches a core dependency to a personal fork, this needs significantly more documentation:
Significant Issues3. Personal fork as a production dependency # pyproject.toml
mlx = { git = "https://github.com/rltakashige/mlx-jaccl-fix-small-recv.git", branch = "address-rdma-gpu-locks", marker = "sys_platform == 'darwin'" }Pinning to a personal fork branch (not even a tag/commit) is a maintenance risk. If the fork is rebased, deleted, or the branch is renamed, builds will break. The
4. The diff removes the entire 5. Version pin removed for darwin MLX # Before
"mlx==0.30.6; sys_platform == 'darwin'",
# After
"mlx; sys_platform == 'darwin'",The version constraint is removed entirely for the main dependency line. While the git source in 6. Silent exception swallowing in shutdown # bootstrap.py finally block
try:
event_sender.close()
task_receiver.close()
except Exception:
pass # <-- silently swallows all errorsThis should at minimum log the exception at debug/warning level. Silent 7. event_sender.cancel_join() # calls buffer.cancel_join_thread()
task_receiver.cancel_join()Python's Minor Issues8. version = let v = "0.30.7.dev20260216+d94f81a2"; inThe 9. mlxPackage = builtins.head (builtins.filter (p: p.name == "mlx" && p.source ? git) uvLock.package);This will fail with an unhelpful error ( 10. except KeyboardInterrupt:
logger.info("Runner received interrupt, shutting down")
What's Good
VerdictNot ready to merge. The PR has merge conflicts with After rebasing and addressing the documentation gaps, the code changes themselves are reasonable and well-motivated. Review only — not a merge approval. |
|
Closing as this change is now done. |
Motivation
Changes
Why It Works
Test Plan
Manual Testing
Automated Testing