Skip to content

feat(cluster): jaccl DistributedGroup bootstrap#195

Open
anupsv wants to merge 1 commit into
feat/tp-default-dispatchfrom
feat/jaccl-bootstrap
Open

feat(cluster): jaccl DistributedGroup bootstrap#195
anupsv wants to merge 1 commit into
feat/tp-default-dispatchfrom
feat/jaccl-bootstrap

Conversation

@anupsv
Copy link
Copy Markdown
Contributor

@anupsv anupsv commented May 21, 2026

⚠️ Stacked PR — merge #194 first. This branch is based on feat/tp-default-dispatch. Until that PR lands on master, this diff will include #194's commits.


Summary

  • Adds DistributedGroupBootstrap — sets the three jaccl env vars (MLX_RANK, MLX_JACCL_COORDINATOR, MLX_IBV_DEVICES), writes the bridge100 topology JSON to /tmp/darkbloom-jaccl-topology-<session>.json, and calls DistributedGroup.initialize(strict: false) with typed error propagation.
  • Adds ClusterMsgType.jacclBootstrap (0x07) and JacclBootstrapPayload to ClusterControlMessage.swift; rank 0 sends its chosen port + sessionID to rank 1 over the established ThunderboltLink control channel.
  • ClusterDiscovery now calls startAsRank0(ownIP:peerIP:) / startAsRank1(ownIP:peerIP:), polls for session readiness, exchanges the jacclBootstrap frame, runs DistributedGroupBootstrap.bootstrap(...), and stores the resulting group in _distributedGroup (exposed as public var distributedGroup: DistributedGroup?).
  • ClusterSession.ClusterPeer.handleConnection routes jacclBootstrap frames through the inferenceHandler alongside inferenceStep/inferenceToken.
  • 15 unit tests in DistributedGroupBootstrapTests.swift covering env-var round-trips, topology JSON structure, JacclBootstrapPayload encode/decode, and DistributedGroup singleton API surface. Real two-process jaccl init is explicitly out of scope (requires TB5 + RDMA enabled in macOS Recovery).

Out of scope (tracked in subsequent PRs)

  • LlamaModelTP loading (PR 4b)
  • TP decode loop on rank 0 + rank 1 serve loop (PR 4b)
  • PP decode loop (PR 4c)
  • Provider request routing (PR 4d)

Test plan

  • swift build in provider-swift/ — passes
  • swift test --filter "DistributedGroupBootstrap" — 15/15 pass
  • Manual: two-Mac TB5 rig, darkbloom serve --rdma-enabled on both; verify jaccl DistributedGroup ready appears in logs on both ranks after cable plug-in

View with Codesmith Autofix with Codesmith
Need help on this PR? Tag @codesmith with what you need. Autofix is disabled.

Wire up the jaccl backend initialization step that runs on both ranks
immediately after the Thunderbolt SE handshake completes.

- Add DistributedGroupBootstrap: sets MLX_RANK, MLX_JACCL_COORDINATOR,
  and MLX_IBV_DEVICES, writes the bridge100 topology JSON to /tmp, then
  calls DistributedGroup.initialize(strict: false) and surfaces failures
  as typed DistributedGroupBootstrapError rather than silent nil.

- Add ClusterMsgType.jacclBootstrap (0x07) + JacclBootstrapPayload
  (port + sessionID): rank 0 sends this to rank 1 over the existing
  ThunderboltLink control channel after its session health becomes
  non-unavailable; rank 1 receives it through the inferenceHandler
  path (jacclBootstrap now routed there alongside inferenceStep/Token).

- ClusterDiscovery: startAsRank0/startAsRank1 now accept ownIP, poll
  for session readiness, exchange the jacclBootstrap frame, and store
  the resulting DistributedGroup in _distributedGroup. Exposed via
  public var distributedGroup: DistributedGroup? for PR 4b.

- 15 unit tests: env-var round-trips, topology JSON structure, frame
  encode/decode, DistributedGroup singleton API surface. Real two-process
  jaccl init is out of scope for single-process CI (needs TB5 + RDMA).
@vercel
Copy link
Copy Markdown

vercel Bot commented May 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
d-inference Ready Ready Preview May 21, 2026 6:07pm
d-inference-console-ui-dev Ready Ready Preview May 21, 2026 6:07pm
d-inference-landing Ready Ready Preview May 21, 2026 6:07pm

Request Review

@github-actions
Copy link
Copy Markdown

Benchmark Results

Runner: macos-15 (M1 Virtual) | Date: 2026-05-21 18:08 UTC

1-provider-streaming

1 providers, 1 users, 30 requests, concurrency=5, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 30
Success 4
Errors 26
Total Duration 4.031s
Throughput 1.0 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 1.841s 1.841s 1.841s 1.841s
parse 4 134µs 180µs 304µs 304µs
reserve 4 5ms 6ms 8ms 8ms
route 4 729µs 801µs 843µs 843µs
coordinator_to_provider 4 1.825s 1.826s 1.827s 1.827s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=133.5µs (threshold=1ms)
parse:p95<=5ms PASS p95=304µs (threshold=5ms)
reserve:mean<=50ms PASS mean=5.1435ms (threshold=50ms)
reserve:p95<=200ms PASS p95=7.864ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-non-streaming

1 providers, 1 users, 20 requests, concurrency=5, streaming=false

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 20
Success 4
Errors 16
Total Duration 2.346s
Throughput 1.7 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 2.343s 2.344s 2.346s 2.346s
parse 4 28µs 29µs 48µs 48µs
reserve 4 5ms 5ms 6ms 6ms
route 4 538µs 567µs 577µs 577µs
coordinator_to_provider 4 1.766s 1.766s 1.768s 1.768s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=27.5µs (threshold=1ms)
parse:p95<=5ms PASS p95=48µs (threshold=5ms)
reserve:mean<=50ms PASS mean=4.5125ms (threshold=50ms)
reserve:p95<=200ms PASS p95=6.163ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

7-provider-multi-model

7 providers, 5 users, 50 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 4 0.5 GB
mlx-community/gemma-3-270m-4bit 3 0.2 GB
Metric Value
Total Requests 50
Success 50
Errors 0
Total Duration 9.7s
Throughput 5.2 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 50 557ms 4ms 3.297s 3.304s
parse 49 17µs 16µs 27µs 44µs
reserve 49 1ms 1ms 3ms 4ms
route 49 443µs 410µs 707µs 858µs
coordinator_to_provider 50 503ms 1ms 3.278s 3.297s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=16.795µs (threshold=1ms)
parse:p95<=5ms PASS p95=27µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.33702ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.644ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-high-concurrency

3 providers, 10 users, 60 requests, concurrency=20, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 60
Success 12
Errors 48
Total Duration 3.355s
Throughput 3.6 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 12 2.141s 2.136s 2.169s 2.169s
parse 12 25µs 14µs 142µs 142µs
reserve 12 2ms 3ms 4ms 4ms
route 12 1ms 1ms 1ms 1ms
coordinator_to_provider 12 2.133s 2.129s 2.162s 2.162s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=25.333µs (threshold=1ms)
parse:p95<=5ms PASS p95=142µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.491ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.518ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-queue-saturation

1 providers, 10 users, 40 requests, concurrency=15, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 40
Success 4
Errors 36
Total Duration 3.294s
Throughput 1.2 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 2.29s 2.29s 2.29s 2.29s
parse 4 33µs 19µs 82µs 82µs
reserve 4 8ms 8ms 8ms 8ms
route 4 1ms 1ms 1ms 1ms
coordinator_to_provider 4 2.277s 2.278s 2.278s 2.278s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=33.25µs (threshold=1ms)
parse:p95<=5ms PASS p95=82µs (threshold=5ms)
reserve:mean<=50ms PASS mean=8.001ms (threshold=50ms)
reserve:p95<=200ms PASS p95=8.159ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-20-users

3 providers, 20 users, 60 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 60
Success 60
Errors 0
Total Duration 6.026s
Throughput 10.0 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 60 367ms 5ms 2.186s 2.186s
parse 60 19µs 17µs 38µs 47µs
reserve 60 1ms 1ms 2ms 3ms
route 60 475µs 434µs 770µs 978µs
coordinator_to_provider 60 364ms 2ms 2.178s 2.18s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=19.183µs (threshold=1ms)
parse:p95<=5ms PASS p95=38µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.369616ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.411ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-scaling

1 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 30
Success 4
Errors 26
Total Duration 3.058s
Throughput 1.3 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 2.16s 2.16s 2.161s 2.161s
parse 4 19µs 22µs 22µs 22µs
reserve 4 2ms 2ms 2ms 2ms
route 4 530µs 589µs 718µs 718µs
coordinator_to_provider 4 2.154s 2.154s 2.154s 2.154s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=19.25µs (threshold=1ms)
parse:p95<=5ms PASS p95=22µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.09425ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.284ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-scaling

3 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 3.744s
Throughput 8.0 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 709ms 7ms 2.134s 2.134s
parse 30 16µs 13µs 35µs 58µs
reserve 30 2ms 1ms 4ms 4ms
route 30 446µs 433µs 646µs 828µs
coordinator_to_provider 30 705ms 4ms 2.127s 2.128s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=15.566µs (threshold=1ms)
parse:p95<=5ms PASS p95=35µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.568233ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.786ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

5-provider-scaling

5 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 5 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 4.146s
Throughput 7.2 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 718ms 5ms 2.17s 2.17s
parse 30 15µs 13µs 28µs 33µs
reserve 30 2ms 1ms 3ms 5ms
route 30 475µs 429µs 777µs 881µs
coordinator_to_provider 30 714ms 2ms 2.163s 2.165s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=15.3µs (threshold=1ms)
parse:p95<=5ms PASS p95=28µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.633633ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.002ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-heavy-100conc-10kb

3 providers, 20 users, 100 requests, concurrency=100, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 100
Success 12
Errors 88
Total Duration 3.504s
Throughput 3.4 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 12 2.197s 2.189s 2.243s 2.243s
parse 12 92µs 91µs 193µs 193µs
reserve 12 8ms 8ms 10ms 10ms
route 12 18ms 18ms 19ms 19ms
coordinator_to_provider 12 2.162s 2.153s 2.208s 2.208s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=92.416µs (threshold=1ms)
parse:p95<=5ms PASS p95=193µs (threshold=5ms)
reserve:mean<=50ms PASS mean=8.42125ms (threshold=50ms)
reserve:p95<=200ms PASS p95=10.033ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant