Address rdma gpu locks by rltakashige · Pull Request #1453 · exo-explore/exo

rltakashige · 2026-02-12T00:00:22Z

Motivation

Changes

Why It Works

Test Plan

Manual Testing

Automated Testing

AlexCheema · 2026-02-17T18:27:00Z

PR #1453 Review: "Address RDMA GPU locks"

Summary of Changes

This PR addresses GPU lock issues in RDMA by switching the macOS MLX dependency from the upstream release (ml-explore/mlx v0.30.6) to a custom fork (rltakashige/mlx-jaccl-fix-small-recv, branch address-rdma-gpu-locks) at commit d94f81a2. It also makes several code changes to improve process shutdown and move the distributed barrier.

Note: The PR template (Motivation, Changes, Why It Works, Test Plan) is entirely empty.

Files Changed (8)

README.md — Add Nix run instructions, add Xcode prerequisite, whitespace fixes
flake.nix — Filter MLX package by git source to handle duplicate uv.lock entries
nix/mlx.nix — Point MLX build at custom fork (version 0.30.7.dev20260216+d94f81a2)
pyproject.toml — Pin macOS MLX to custom git fork; remove version pin for darwin
uv.lock — Updated lock file reflecting fork switch; removes mlx-metal package
src/exo/worker/engines/mlx/generator/generate.py — Move mx_barrier(group) into prefill() function (before stream_generate) instead of in mlx_generate() caller
src/exo/worker/runner/bootstrap.py — Add KeyboardInterrupt handler; switch from join() to cancel_join() for cleaner shutdown; add broad exception catch in finally
src/exo/utils/channels.py — Add cancel_join() method to SendChannel and ReceiveChannel, wrapping multiprocessing buffer's cancel_join_thread()

Detailed Analysis

1. MLX Fork Switch (pyproject.toml, uv.lock, nix/mlx.nix, flake.nix)

macOS MLX dependency changed from PyPI release 0.30.6 to a git fork (rltakashige/mlx-jaccl-fix-small-recv @ d94f81a2)
Linux MLX remains pinned to 0.30.6 from PyPI (unchanged)
mlx-metal package is removed from the lock file
flake.nix adds && p.source ? git filter to disambiguate the two MLX entries in uv.lock

Concerns:

Using a personal fork rather than upstream is a maintenance risk. What specific commits in the fork fix the GPU locks? Is there an upstream PR?
The fork name "mlx-jaccl-fix-small-recv" suggests it fixes small receive operations in JACCL/RDMA — more context needed.
Removing the version pin for darwin MLX ("mlx; sys_platform == 'darwin'") means any version could resolve if the git source is unavailable.

2. Barrier Move (generate.py)

mx_barrier(group) and the "Ready to prefill" log moved from mlx_generate() into prefill(), just before stream_generate()
The group parameter is added to prefill()'s signature (optional, defaults to None)
This ensures the barrier happens closer to the actual distributed compute, which could prevent GPU lock contention by synchronizing nodes right before the prefill operation.

Concern: mx_barrier(None) is now called when group is None — need to verify that mx_barrier handles None gracefully (no-op) rather than crashing.

3. Process Shutdown (bootstrap.py, channels.py)

Added KeyboardInterrupt handler in runner entrypoint
Changed join() → cancel_join() in the finally block, which calls multiprocessing's cancel_join_thread() — prevents the runner process from hanging on shutdown
Added bare except Exception: pass around close() calls in finally block

Concerns:

cancel_join_thread() can cause data loss if there's unflushed data in the buffer. Is this acceptable for the runner shutdown path?
The broad except Exception: pass silently swallows errors — at minimum, these should be logged.

4. README.md

Adds Nix run instructions (nice addition)
Adds Xcode as a prerequisite (important for Metal toolchain)
Minor whitespace fixes

Overall Assessment

Positives:

The core idea (using a fork that fixes RDMA GPU locks) addresses a real problem in distributed inference
Moving the barrier closer to the actual distributed compute is sound
Cleaner shutdown handling prevents process hangs
README improvements are welcome

Concerns:

Empty PR description — No motivation, no explanation of what the fork fixes, no test plan documented
Personal fork dependency — Maintenance risk; should ideally reference an upstream MLX PR or provide the specific fix commits
Silent exception swallowing in bootstrap.py finally block
cancel_join_thread() data loss risk — May lose events in transit
No automated tests for the shutdown path changes
mx_barrier(None) behavior unverified

AlexCheema · 2026-02-17T23:36:02Z

Code Review — PR #1453: Address rdma gpu locks

CI status: All checks passing (typecheck, aarch64-darwin, x86_64-linux, aarch64-linux)
Mergeable: No — this PR has merge conflicts with main and cannot be merged as-is.

Overview

This PR switches the macOS MLX dependency from the upstream PyPI release (v0.30.6) to a custom fork (rltakashige/mlx-jaccl-fix-small-recv @ d94f81a2) to address GPU lock issues in RDMA/JACCL distributed inference. It also moves the mx_barrier call into the prefill() function, improves runner process shutdown handling, and adds minor README improvements.

Critical Issues

1. Merge conflicts — branch is stale and incompatible with main

The PR branch is based on an older version of main. GitHub reports mergeable: false / mergeable_state: dirty. Specifically, src/exo/worker/runner/bootstrap.py on main now has:

A cancel_receiver: MpReceiver[TaskId] parameter
A pipe_fifo_paths: tuple[str, str] | None parameter
FIFO setup logic for JACCL
cancel_receiver.close() and cancel_receiver.join() in the finally block

The PR branch has none of these. This must be rebased and the shutdown changes re-applied to the current main version of bootstrap.py.

2. Empty PR description

The Motivation, Changes, Why It Works, and Test Plan sections are all blank. For a PR that switches a core dependency to a personal fork, this needs significantly more documentation:

What specific GPU lock scenario does this fix?
What commits in the fork address the issue?
Is there an upstream MLX PR or issue tracking this?
What manual testing was performed and on what hardware?

Significant Issues

3. Personal fork as a production dependency

# pyproject.toml
mlx = { git = "https://github.com/rltakashige/mlx-jaccl-fix-small-recv.git", branch = "address-rdma-gpu-locks", marker = "sys_platform == 'darwin'" }

Pinning to a personal fork branch (not even a tag/commit) is a maintenance risk. If the fork is rebased, deleted, or the branch is renamed, builds will break. The uv.lock does pin to commit d94f81a2, which helps, but the pyproject.toml source points to a branch. Questions:

Is there an upstream ml-explore/mlx PR for these changes?
What is the plan to return to upstream MLX?
Can this be pinned to the commit SHA instead of a branch name?

4. mlx-metal package removed from lock file

The diff removes the entire mlx-metal package from uv.lock and strips macOS wheel entries from the mlx 0.30.6 registry package. The fork presumably bundles Metal support differently. This should be explicitly called out since mlx-metal is the Metal GPU backend — if the fork handles this internally, that is fine, but it needs documentation.

5. Version pin removed for darwin MLX

# Before
"mlx==0.30.6; sys_platform == 'darwin'",
# After
"mlx; sys_platform == 'darwin'",

The version constraint is removed entirely for the main dependency line. While the git source in [tool.uv.sources] controls resolution, removing the version pin means if the source override is ever removed or fails, any MLX version could resolve. Consider keeping a minimum version constraint: "mlx>=0.30.6; sys_platform == 'darwin'".

6. Silent exception swallowing in shutdown

# bootstrap.py finally block
try:
    event_sender.close()
    task_receiver.close()
except Exception:
    pass  # <-- silently swallows all errors

This should at minimum log the exception at debug/warning level. Silent except Exception: pass makes debugging shutdown issues very difficult.

7. cancel_join_thread() may lose in-flight events

event_sender.cancel_join()   # calls buffer.cancel_join_thread()
task_receiver.cancel_join()

Python's cancel_join_thread() documentation warns: "this can cause enqueued data to be silently lost." For task_receiver this is likely fine (we are shutting down, no need for more tasks), but for event_sender this could mean a RunnerFailed event sent in the except block never reaches the parent process. The parent would then not know the runner crashed. Consider keeping join() for event_sender (with a timeout if possible) and only using cancel_join() for receivers.

Minor Issues

8. nix/mlx.nix version string

version = let v = "0.30.7.dev20260216+d94f81a2"; in

The +d94f81a2 local version identifier is good for traceability but is a dev pre-release version. This should be noted as temporary.

9. flake.nix filter assumes git source exists

mlxPackage = builtins.head (builtins.filter (p: p.name == "mlx" && p.source ? git) uvLock.package);

This will fail with an unhelpful error (head on empty list) if no MLX package with a git source exists in uv.lock. When switching back to upstream, this filter must be reverted too.

10. KeyboardInterrupt placement

except KeyboardInterrupt:
    logger.info("Runner received interrupt, shutting down")

KeyboardInterrupt inherits from BaseException, not Exception, so it correctly will not be caught by the except Exception below. However, it is placed between ClosedResourceError and Exception — consider placing it after Exception for clarity, since convention is to handle Exception subclasses first, then BaseException subclasses.

What's Good

The core idea is sound. Using a custom MLX build that fixes RDMA GPU locks is a valid approach for unblocking distributed inference.
Moving mx_barrier into prefill() is a clean refactor that keeps the synchronization closer to the distributed compute. I verified that mx_barrier correctly handles group=None as a no-op (early return in src/exo/worker/engines/mlx/utils_mlx.py:591-593).
cancel_join() for receivers is the right approach to prevent hung shutdown when multiprocessing queues block on join_thread().
README additions (Nix instructions, Xcode prerequisite) are helpful for new contributors.
The flake.nix disambiguation (p.source ? git) is a pragmatic solution to handle two MLX entries in uv.lock.

Verdict

Not ready to merge. The PR has merge conflicts with main that must be resolved first — particularly in bootstrap.py where main has added cancel_receiver and FIFO pipe support that this PR's changes need to account for. The empty PR description needs to be filled in with motivation, the specific GPU lock scenario being fixed, and test results. The dependency on a personal fork branch should be documented with a plan for upstreaming.

After rebasing and addressing the documentation gaps, the code changes themselves are reasonable and well-motivated.

Review only — not a merge approval.

rltakashige · 2026-02-20T13:16:55Z

Closing as this change is now done.

rltakashige and others added 4 commits February 11, 2026 22:53

Address GPU timeouts with custom fork

8eaaee7

exit cleanly

22227ef

Merge branch 'main' into leo/address-rdma-gpu-locks

659dbde

Use custom fork that resolves GPU locks

b18372b

rltakashige force-pushed the leo/address-rdma-gpu-locks branch from 2c3fb2b to b18372b Compare February 16, 2026 19:15

rltakashige added 2 commits February 17, 2026 11:34

Add to nix flake

be2ff98

Add to nix flake

03b1e11

AlexCheema mentioned this pull request Feb 17, 2026

Use custom fork that resolves GPU locks #1489

Merged

rltakashige closed this Feb 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address rdma gpu locks#1453

Address rdma gpu locks#1453
rltakashige wants to merge 6 commits intomainfrom
leo/address-rdma-gpu-locks

rltakashige commented Feb 12, 2026

Uh oh!

AlexCheema commented Feb 17, 2026

Uh oh!

AlexCheema commented Feb 17, 2026

Uh oh!

rltakashige commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rltakashige commented Feb 12, 2026

Motivation

Changes

Why It Works

Test Plan

Manual Testing

Automated Testing

Uh oh!

AlexCheema commented Feb 17, 2026

PR #1453 Review: "Address RDMA GPU locks"

Summary of Changes

Files Changed (8)

Detailed Analysis

1. MLX Fork Switch (pyproject.toml, uv.lock, nix/mlx.nix, flake.nix)

2. Barrier Move (generate.py)

3. Process Shutdown (bootstrap.py, channels.py)

4. README.md

Overall Assessment

Uh oh!

AlexCheema commented Feb 17, 2026

Code Review — PR #1453: Address rdma gpu locks

Overview

Critical Issues

Significant Issues

Minor Issues

What's Good

Verdict

Uh oh!

rltakashige commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants