Skip to content

Fix JACCL GID selection on Apple Thunderbolt RDMA (errno 22 RTR fail)#3468

Open
danielkristofik wants to merge 1 commit intoml-explore:mainfrom
danielkristofik:fix/jaccl-gid-fallback-apple-tb
Open

Fix JACCL GID selection on Apple Thunderbolt RDMA (errno 22 RTR fail)#3468
danielkristofik wants to merge 1 commit intoml-explore:mainfrom
danielkristofik:fix/jaccl-gid-fallback-apple-tb

Conversation

@danielkristofik
Copy link
Copy Markdown

Closes #3467.

Proposed changes

Fix [jaccl] Changing queue pair to RTR failed with errno 22 on Apple Thunderbolt RDMA.

Connection::info() in rdma.cpp selects the local GID by scanning the GID table for an IPv4-mapped IPv6 GID (::ffff:x.x.x.x — the RoCE v2 standard). Apple Thunderbolt RDMA exposes only link-local IPv6 GIDs (fe80::...), so the filter never matches and gid is left uninitialized. The garbage value is propagated to the peer via the side channel, causing the kernel to reject the QP RTR transition with EINVAL.

This regression was introduced by #3412 (Jaccl refactor) when the pre-existing hardcoded query_gid(ctx, 1, 1, &gid) was replaced with the filter loop. See full root-cause analysis in the linked issue.

Changes

mlx/distributed/jaccl/lib/jaccl/rdma.cpp:

  • Zero-initialize gid to avoid undefined behavior when no GID matches.
  • After the IPv4-mapped scan, fall back to the first non-zero GID — preferring index 1 (matches pre-refactor behavior; index 0 on Apple TB is typically derived from a non-RDMA interface and routes elsewhere, causing errno 60 ETIMEDOUT instead of errno 22).

The IPv4-mapped path remains the preferred match, so RoCE v2 setups are unaffected.

Test plan

  • 2× Mac Studio M4 Max, Thunderbolt 5 mesh, macOS 26.4.1
  • Before patch: mlx.distributed.init(backend="jaccl")errno 22 RTR fail
  • After patch (preferring index 0 in fallback): errno 60 ETIMEDOUT (wrong port — index 0 is non-RDMA on Apple)
  • After patch (preferring index 1 in fallback): JACCL init succeeds, all_sum benchmark runs, sustained tensor-parallel inference (Qwen3.6-27B-4bit) works end-to-end

Checklist

  • I have read the CONTRIBUTING document
  • I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have updated the necessary documentation (if needed)

Note: no unit tests added — the GID selection is platform-dependent (Apple TB vs RoCE v2 hardware) and the tests in mlx/distributed/jaccl/lib/examples/ are runtime benchmarks rather than unit tests. Happy to add a mock-based unit test if reviewers prefer.

Connection::info() in rdma.cpp scans the GID table for IPv4-mapped IPv6
GIDs (::ffff:x.x.x.x, RoCE v2 format). Apple Thunderbolt RDMA exposes only
link-local IPv6 GIDs (fe80::...) — the filter never matches and gid is left
uninitialized. The garbage value is then sent to the peer via the side
channel, causing the kernel to reject the QP RTR transition with EINVAL
(errno 22).

Initialize gid to zero, factor the GID selection into a try_gid() helper,
and add a fallback that prefers index 1 (the actual rdma_enX port GID on
Apple TB; index 0 is typically derived from a non-RDMA interface and
routes elsewhere, surfacing as errno 60 ETIMEDOUT). The IPv4-mapped path
remains the preferred match, so RoCE v2 setups are unaffected.

Tested on 2x Mac Studio M4 Max + Thunderbolt 5 mesh, macOS 26.4.1.
Distributed init succeeds, sustained tensor-parallel inference works.
Drifter4242 added a commit to Drifter4242/mlx-jaccl-fix-small-recv that referenced this pull request May 1, 2026
Port of ml-explore/mlx PR ml-explore#3468 by danielkristofik.

The GID scan loop in Connection::info() only accepts IPv4-mapped IPv6 GIDs
(::ffff:x.x.x.x). Apple Thunderbolt RDMA exposes only link-local IPv6 GIDs
(fe80::...), so the loop never matches, leaving gid uninitialized. The garbage
value causes errno=22 EINVAL on queue pair RTR transition.

Fix: zero-initialize gid, prefer IPv4-mapped first (preserves RoCE v2), then
fall back to index 1 (the actual RDMA port GID on Apple TB).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] JACCL "Changing queue pair to RTR failed with errno 22" on Apple Thunderbolt RDMA — GID selection regression in #3412

1 participant