Skip to content

Address rdma gpu locks#1453

Closed
rltakashige wants to merge 6 commits intomainfrom
leo/address-rdma-gpu-locks
Closed

Address rdma gpu locks#1453
rltakashige wants to merge 6 commits intomainfrom
leo/address-rdma-gpu-locks

Conversation

@rltakashige
Copy link
Copy Markdown
Collaborator

Motivation

Changes

Why It Works

Test Plan

Manual Testing

Automated Testing

@rltakashige rltakashige force-pushed the leo/address-rdma-gpu-locks branch from 2c3fb2b to b18372b Compare February 16, 2026 19:15
@AlexCheema
Copy link
Copy Markdown
Contributor

PR #1453 Review: "Address RDMA GPU locks"

Summary of Changes

This PR addresses GPU lock issues in RDMA by switching the macOS MLX dependency from the upstream release (ml-explore/mlx v0.30.6) to a custom fork (rltakashige/mlx-jaccl-fix-small-recv, branch address-rdma-gpu-locks) at commit d94f81a2. It also makes several code changes to improve process shutdown and move the distributed barrier.

Note: The PR template (Motivation, Changes, Why It Works, Test Plan) is entirely empty.


Files Changed (8)

  1. README.md — Add Nix run instructions, add Xcode prerequisite, whitespace fixes
  2. flake.nix — Filter MLX package by git source to handle duplicate uv.lock entries
  3. nix/mlx.nix — Point MLX build at custom fork (version 0.30.7.dev20260216+d94f81a2)
  4. pyproject.toml — Pin macOS MLX to custom git fork; remove version pin for darwin
  5. uv.lock — Updated lock file reflecting fork switch; removes mlx-metal package
  6. src/exo/worker/engines/mlx/generator/generate.py — Move mx_barrier(group) into prefill() function (before stream_generate) instead of in mlx_generate() caller
  7. src/exo/worker/runner/bootstrap.py — Add KeyboardInterrupt handler; switch from join() to cancel_join() for cleaner shutdown; add broad exception catch in finally
  8. src/exo/utils/channels.py — Add cancel_join() method to SendChannel and ReceiveChannel, wrapping multiprocessing buffer's cancel_join_thread()

Detailed Analysis

1. MLX Fork Switch (pyproject.toml, uv.lock, nix/mlx.nix, flake.nix)

  • macOS MLX dependency changed from PyPI release 0.30.6 to a git fork (rltakashige/mlx-jaccl-fix-small-recv @ d94f81a2)
  • Linux MLX remains pinned to 0.30.6 from PyPI (unchanged)
  • mlx-metal package is removed from the lock file
  • flake.nix adds && p.source ? git filter to disambiguate the two MLX entries in uv.lock

Concerns:

  • Using a personal fork rather than upstream is a maintenance risk. What specific commits in the fork fix the GPU locks? Is there an upstream PR?
  • The fork name "mlx-jaccl-fix-small-recv" suggests it fixes small receive operations in JACCL/RDMA — more context needed.
  • Removing the version pin for darwin MLX ("mlx; sys_platform == 'darwin'") means any version could resolve if the git source is unavailable.

2. Barrier Move (generate.py)

  • mx_barrier(group) and the "Ready to prefill" log moved from mlx_generate() into prefill(), just before stream_generate()
  • The group parameter is added to prefill()'s signature (optional, defaults to None)
  • This ensures the barrier happens closer to the actual distributed compute, which could prevent GPU lock contention by synchronizing nodes right before the prefill operation.

Concern: mx_barrier(None) is now called when group is None — need to verify that mx_barrier handles None gracefully (no-op) rather than crashing.

3. Process Shutdown (bootstrap.py, channels.py)

  • Added KeyboardInterrupt handler in runner entrypoint
  • Changed join()cancel_join() in the finally block, which calls multiprocessing's cancel_join_thread() — prevents the runner process from hanging on shutdown
  • Added bare except Exception: pass around close() calls in finally block

Concerns:

  • cancel_join_thread() can cause data loss if there's unflushed data in the buffer. Is this acceptable for the runner shutdown path?
  • The broad except Exception: pass silently swallows errors — at minimum, these should be logged.

4. README.md

  • Adds Nix run instructions (nice addition)
  • Adds Xcode as a prerequisite (important for Metal toolchain)
  • Minor whitespace fixes

Overall Assessment

Positives:

  • The core idea (using a fork that fixes RDMA GPU locks) addresses a real problem in distributed inference
  • Moving the barrier closer to the actual distributed compute is sound
  • Cleaner shutdown handling prevents process hangs
  • README improvements are welcome

Concerns:

  1. Empty PR description — No motivation, no explanation of what the fork fixes, no test plan documented
  2. Personal fork dependency — Maintenance risk; should ideally reference an upstream MLX PR or provide the specific fix commits
  3. Silent exception swallowing in bootstrap.py finally block
  4. cancel_join_thread() data loss risk — May lose events in transit
  5. No automated tests for the shutdown path changes
  6. mx_barrier(None) behavior unverified

@AlexCheema
Copy link
Copy Markdown
Contributor

Code Review — PR #1453: Address rdma gpu locks

CI status: All checks passing (typecheck, aarch64-darwin, x86_64-linux, aarch64-linux)
Mergeable: No — this PR has merge conflicts with main and cannot be merged as-is.


Overview

This PR switches the macOS MLX dependency from the upstream PyPI release (v0.30.6) to a custom fork (rltakashige/mlx-jaccl-fix-small-recv @ d94f81a2) to address GPU lock issues in RDMA/JACCL distributed inference. It also moves the mx_barrier call into the prefill() function, improves runner process shutdown handling, and adds minor README improvements.


Critical Issues

1. Merge conflicts — branch is stale and incompatible with main

The PR branch is based on an older version of main. GitHub reports mergeable: false / mergeable_state: dirty. Specifically, src/exo/worker/runner/bootstrap.py on main now has:

  • A cancel_receiver: MpReceiver[TaskId] parameter
  • A pipe_fifo_paths: tuple[str, str] | None parameter
  • FIFO setup logic for JACCL
  • cancel_receiver.close() and cancel_receiver.join() in the finally block

The PR branch has none of these. This must be rebased and the shutdown changes re-applied to the current main version of bootstrap.py.

2. Empty PR description

The Motivation, Changes, Why It Works, and Test Plan sections are all blank. For a PR that switches a core dependency to a personal fork, this needs significantly more documentation:

  • What specific GPU lock scenario does this fix?
  • What commits in the fork address the issue?
  • Is there an upstream MLX PR or issue tracking this?
  • What manual testing was performed and on what hardware?

Significant Issues

3. Personal fork as a production dependency

# pyproject.toml
mlx = { git = "https://github.com/rltakashige/mlx-jaccl-fix-small-recv.git", branch = "address-rdma-gpu-locks", marker = "sys_platform == 'darwin'" }

Pinning to a personal fork branch (not even a tag/commit) is a maintenance risk. If the fork is rebased, deleted, or the branch is renamed, builds will break. The uv.lock does pin to commit d94f81a2, which helps, but the pyproject.toml source points to a branch. Questions:

  • Is there an upstream ml-explore/mlx PR for these changes?
  • What is the plan to return to upstream MLX?
  • Can this be pinned to the commit SHA instead of a branch name?

4. mlx-metal package removed from lock file

The diff removes the entire mlx-metal package from uv.lock and strips macOS wheel entries from the mlx 0.30.6 registry package. The fork presumably bundles Metal support differently. This should be explicitly called out since mlx-metal is the Metal GPU backend — if the fork handles this internally, that is fine, but it needs documentation.

5. Version pin removed for darwin MLX

# Before
"mlx==0.30.6; sys_platform == 'darwin'",
# After
"mlx; sys_platform == 'darwin'",

The version constraint is removed entirely for the main dependency line. While the git source in [tool.uv.sources] controls resolution, removing the version pin means if the source override is ever removed or fails, any MLX version could resolve. Consider keeping a minimum version constraint: "mlx>=0.30.6; sys_platform == 'darwin'".

6. Silent exception swallowing in shutdown

# bootstrap.py finally block
try:
    event_sender.close()
    task_receiver.close()
except Exception:
    pass  # <-- silently swallows all errors

This should at minimum log the exception at debug/warning level. Silent except Exception: pass makes debugging shutdown issues very difficult.

7. cancel_join_thread() may lose in-flight events

event_sender.cancel_join()   # calls buffer.cancel_join_thread()
task_receiver.cancel_join()

Python's cancel_join_thread() documentation warns: "this can cause enqueued data to be silently lost." For task_receiver this is likely fine (we are shutting down, no need for more tasks), but for event_sender this could mean a RunnerFailed event sent in the except block never reaches the parent process. The parent would then not know the runner crashed. Consider keeping join() for event_sender (with a timeout if possible) and only using cancel_join() for receivers.


Minor Issues

8. nix/mlx.nix version string

version = let v = "0.30.7.dev20260216+d94f81a2"; in

The +d94f81a2 local version identifier is good for traceability but is a dev pre-release version. This should be noted as temporary.

9. flake.nix filter assumes git source exists

mlxPackage = builtins.head (builtins.filter (p: p.name == "mlx" && p.source ? git) uvLock.package);

This will fail with an unhelpful error (head on empty list) if no MLX package with a git source exists in uv.lock. When switching back to upstream, this filter must be reverted too.

10. KeyboardInterrupt placement

except KeyboardInterrupt:
    logger.info("Runner received interrupt, shutting down")

KeyboardInterrupt inherits from BaseException, not Exception, so it correctly will not be caught by the except Exception below. However, it is placed between ClosedResourceError and Exception — consider placing it after Exception for clarity, since convention is to handle Exception subclasses first, then BaseException subclasses.


What's Good

  • The core idea is sound. Using a custom MLX build that fixes RDMA GPU locks is a valid approach for unblocking distributed inference.
  • Moving mx_barrier into prefill() is a clean refactor that keeps the synchronization closer to the distributed compute. I verified that mx_barrier correctly handles group=None as a no-op (early return in src/exo/worker/engines/mlx/utils_mlx.py:591-593).
  • cancel_join() for receivers is the right approach to prevent hung shutdown when multiprocessing queues block on join_thread().
  • README additions (Nix instructions, Xcode prerequisite) are helpful for new contributors.
  • The flake.nix disambiguation (p.source ? git) is a pragmatic solution to handle two MLX entries in uv.lock.

Verdict

Not ready to merge. The PR has merge conflicts with main that must be resolved first — particularly in bootstrap.py where main has added cancel_receiver and FIFO pipe support that this PR's changes need to account for. The empty PR description needs to be filled in with motivation, the specific GPU lock scenario being fixed, and test results. The dependency on a personal fork branch should be documented with a plan for upstreaming.

After rebasing and addressing the documentation gaps, the code changes themselves are reasonable and well-motivated.

Review only — not a merge approval.

@rltakashige
Copy link
Copy Markdown
Collaborator Author

Closing as this change is now done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants