feat(cluster): jaccl DistributedGroup bootstrap#195
Conversation
Wire up the jaccl backend initialization step that runs on both ranks immediately after the Thunderbolt SE handshake completes. - Add DistributedGroupBootstrap: sets MLX_RANK, MLX_JACCL_COORDINATOR, and MLX_IBV_DEVICES, writes the bridge100 topology JSON to /tmp, then calls DistributedGroup.initialize(strict: false) and surfaces failures as typed DistributedGroupBootstrapError rather than silent nil. - Add ClusterMsgType.jacclBootstrap (0x07) + JacclBootstrapPayload (port + sessionID): rank 0 sends this to rank 1 over the existing ThunderboltLink control channel after its session health becomes non-unavailable; rank 1 receives it through the inferenceHandler path (jacclBootstrap now routed there alongside inferenceStep/Token). - ClusterDiscovery: startAsRank0/startAsRank1 now accept ownIP, poll for session readiness, exchange the jacclBootstrap frame, and store the resulting DistributedGroup in _distributedGroup. Exposed via public var distributedGroup: DistributedGroup? for PR 4b. - 15 unit tests: env-var round-trips, topology JSON structure, frame encode/decode, DistributedGroup singleton API surface. Real two-process jaccl init is out of scope for single-process CI (needs TB5 + RDMA).
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Benchmark ResultsRunner: 1-provider-streaming1 providers, 1 users, 30 requests, concurrency=5, streaming=true
Latency Decomposition
Assertion Report: FAIL
1-provider-non-streaming1 providers, 1 users, 20 requests, concurrency=5, streaming=false
Latency Decomposition
Assertion Report: FAIL
7-provider-multi-model7 providers, 5 users, 50 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: FAIL
3-provider-high-concurrency3 providers, 10 users, 60 requests, concurrency=20, streaming=true
Latency Decomposition
Assertion Report: FAIL
1-provider-queue-saturation1 providers, 10 users, 40 requests, concurrency=15, streaming=true
Latency Decomposition
Assertion Report: FAIL
3-provider-20-users3 providers, 20 users, 60 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: FAIL
1-provider-scaling1 providers, 5 users, 30 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: FAIL
3-provider-scaling3 providers, 5 users, 30 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: FAIL
5-provider-scaling5 providers, 5 users, 30 requests, concurrency=10, streaming=true
Latency Decomposition
Assertion Report: FAIL
3-provider-heavy-100conc-10kb3 providers, 20 users, 100 requests, concurrency=100, streaming=true
Latency Decomposition
Assertion Report: FAIL
|
Summary
DistributedGroupBootstrap— sets the three jaccl env vars (MLX_RANK,MLX_JACCL_COORDINATOR,MLX_IBV_DEVICES), writes thebridge100topology JSON to/tmp/darkbloom-jaccl-topology-<session>.json, and callsDistributedGroup.initialize(strict: false)with typed error propagation.ClusterMsgType.jacclBootstrap(0x07) andJacclBootstrapPayloadtoClusterControlMessage.swift; rank 0 sends its chosen port + sessionID to rank 1 over the established ThunderboltLink control channel.ClusterDiscoverynow callsstartAsRank0(ownIP:peerIP:)/startAsRank1(ownIP:peerIP:), polls for session readiness, exchanges thejacclBootstrapframe, runsDistributedGroupBootstrap.bootstrap(...), and stores the resulting group in_distributedGroup(exposed aspublic var distributedGroup: DistributedGroup?).ClusterSession.ClusterPeer.handleConnectionroutesjacclBootstrapframes through theinferenceHandleralongsideinferenceStep/inferenceToken.DistributedGroupBootstrapTests.swiftcovering env-var round-trips, topology JSON structure,JacclBootstrapPayloadencode/decode, andDistributedGroupsingleton API surface. Real two-process jaccl init is explicitly out of scope (requires TB5 + RDMA enabled in macOS Recovery).Out of scope (tracked in subsequent PRs)
LlamaModelTPloading (PR 4b)Test plan
swift buildinprovider-swift/— passesswift test --filter "DistributedGroupBootstrap"— 15/15 passdarkbloom serve --rdma-enabledon both; verifyjaccl DistributedGroup readyappears in logs on both ranks after cable plug-inNeed help on this PR? Tag
@codesmithwith what you need. Autofix is disabled.