Skip to content

feat(cluster): TP decode loop (rank 0 + rank 1)#196

Open
anupsv wants to merge 1 commit into
feat/jaccl-bootstrapfrom
feat/tp-decode-loop
Open

feat(cluster): TP decode loop (rank 0 + rank 1)#196
anupsv wants to merge 1 commit into
feat/jaccl-bootstrapfrom
feat/tp-decode-loop

Conversation

@anupsv
Copy link
Copy Markdown
Contributor

@anupsv anupsv commented May 21, 2026

⚠️ Stacked PR — merge #195 first. This branch is based on feat/jaccl-bootstrap. Until that PR lands on master, this diff will include #195's commits.


What

Implements PR 4b of the tensor-parallel inference stack: the rank-0 decode loop and rank-1 serve loop for LlamaModelTP.

Changes

Protocol

  • ClusterControlMessage: three new message types — promptTokens (0x08), stepToken (0x09), sessionStop (0x0A) — each with a Codable, Sendable payload struct carrying a request uid
  • ClusterPeer.serve: jacclBootstrap frames now route to a dedicated bootstrapHandler parameter; inference frames go to inferenceHandler. Clean separation that was a stopgap in PR 4a.

Engine (rank 0) — TensorParallelEngine

  • Actor-isolated; holds LlamaModelTP via nonisolated(unsafe) for Swift 6 compatibility
  • generate(promptTokens:maxTokens:eosTokenIDs:) -> AsyncStream<Int>: sends promptTokens to rank 1, prefills, greedily samples, loops stepToken per token, sends sessionStop at end
  • Greedy-only (no temperature/top-p); singleton group degenerates to standard single-rank inference (allreduce is a no-op)

Server (rank 1) — TensorParallelServer

  • Actor-isolated; handleFrame(_ data: Data) dispatches on ClusterMsgType
  • promptTokens → reset KV cache + prefill (discards logits)
  • stepToken → decode step (discards logits; rank 0 samples)
  • sessionStop → clear cache

Model loading — ClusterModelLoader

  • Reads config.json, decodes LlamaConfiguration, calls MLX.DistributedGroup() (reads jaccl env vars set during bootstrap), constructs LlamaModelTP, loads weights via MLXLMCommon.loadWeights
  • Separate from LLMModelFactory because the factory doesn't thread DistributedGroup through its pipeline

Wiring — ClusterDiscovery

  • After jaccl bootstrap completes on either rank, attempts to build the engine (rank 0) or server (rank 1) if modelDirectory is set
  • setModelDirectory(_:) public setter; currentEngine() / currentServer() accessors for the provider serve loop (PR 4d)

Swift 6 Sendable

  • UncheckedSendableLLMModel (public struct, @unchecked Sendable) wraps any LLMModel for safe single-owner transfers across actor boundaries without sending
  • All engine/server inits accept UncheckedSendableLLMModel; internal storage uses nonisolated(unsafe) let

Tests (TensorParallelDecodeTests.swift — 17 tests, all passing)

  • Raw values for the three new ClusterMsgType cases
  • JSON round-trip for all three payload structs
  • TensorParallelEngine and TensorParallelServer construct without error on singleton group
  • generate() produces ≤ maxTokens tokens and completes cleanly
  • generate() is deterministic (same weights + same prompt → same output)
  • generate() stops at EOS token after exactly 1 token
  • Frame sequence: promptTokens → stepToken × N → sessionStop
  • promptTokens frame content (uid, tokens, maxTokens)
  • handleFrame routing for all three server-side frame types
  • ClusterPeer.serve signature compilation check (bootstrapHandler present)

Out of scope

  • LlamaModelTPQ (quantized variant)
  • Pipeline-parallel decode (PR 4c)
  • Failure modes / heartbeat / request routing to the engine (PR 4d)
  • Sampling variants other than greedy

View with Codesmith Autofix with Codesmith
Need help on this PR? Tag @codesmith with what you need. Autofix is disabled.

Add tensor-parallel inference engine and server:

- ClusterControlMessage: add promptTokens (0x08), stepToken (0x09),
  sessionStop (0x0A) message types with Codable payload structs
- ClusterSession/ClusterPeer: split jacclBootstrap into its own
  bootstrapHandler parameter, separate from inferenceHandler
- TensorParallelInference: implement TensorParallelEngine (rank 0,
  greedy decode loop with AsyncStream) and TensorParallelServer
  (rank 1 frame handler); add ClusterSessionSendable protocol for
  testability; expose UncheckedSendableLLMModel for Swift 6 safety
- ClusterModelLoader: load LlamaModelTP from a model directory using
  the jaccl DistributedGroup from the process environment
- ClusterDiscovery: wire LlamaModelTP construction into the bootstrap
  completion path; expose currentEngine()/currentServer() accessors
- TensorParallelDecodeTests: 17 tests covering message types, payload
  round-trips, engine/server construction, generate() semantics
  (maxTokens, EOS, determinism, frame sequence), and handleFrame routing
@vercel
Copy link
Copy Markdown

vercel Bot commented May 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
d-inference Ready Ready Preview May 21, 2026 6:33pm
d-inference-console-ui-dev Ready Ready Preview May 21, 2026 6:33pm
d-inference-landing Ready Ready Preview May 21, 2026 6:33pm

Request Review

@github-actions
Copy link
Copy Markdown

Benchmark Results

Runner: macos-15 (M1 Virtual) | Date: 2026-05-21 18:35 UTC

1-provider-streaming

1 providers, 1 users, 30 requests, concurrency=5, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 30
Success 4
Errors 26
Total Duration 3.76s
Throughput 1.1 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 1.691s 1.691s 1.691s 1.691s
parse 4 18µs 15µs 32µs 32µs
reserve 4 3ms 3ms 4ms 4ms
route 4 383µs 415µs 445µs 445µs
coordinator_to_provider 4 1.685s 1.685s 1.686s 1.686s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=17.5µs (threshold=1ms)
parse:p95<=5ms PASS p95=32µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.5965ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.841ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-non-streaming

1 providers, 1 users, 20 requests, concurrency=5, streaming=false

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 20
Success 4
Errors 16
Total Duration 2.377s
Throughput 1.7 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 2.374s 2.375s 2.376s 2.376s
parse 4 96µs 91µs 196µs 196µs
reserve 4 7ms 8ms 9ms 9ms
route 4 1ms 1ms 3ms 3ms
coordinator_to_provider 4 1.716s 1.717s 1.718s 1.718s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=96.25µs (threshold=1ms)
parse:p95<=5ms PASS p95=196µs (threshold=5ms)
reserve:mean<=50ms PASS mean=7.15ms (threshold=50ms)
reserve:p95<=200ms PASS p95=8.904ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

7-provider-multi-model

7 providers, 5 users, 50 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 4 0.5 GB
mlx-community/gemma-3-270m-4bit 3 0.2 GB
Metric Value
Total Requests 50
Success 50
Errors 0
Total Duration 10.533s
Throughput 4.7 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 50 642ms 4ms 3.837s 3.843s
parse 48 18µs 17µs 47µs 53µs
reserve 48 1ms 1ms 3ms 4ms
route 48 436µs 406µs 702µs 790µs
coordinator_to_provider 50 538ms 1ms 3.815s 3.838s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=18.375µs (threshold=1ms)
parse:p95<=5ms PASS p95=47µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.45227ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.01ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-high-concurrency

3 providers, 10 users, 60 requests, concurrency=20, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 60
Success 12
Errors 48
Total Duration 3.602s
Throughput 3.3 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 12 2.153s 2.149s 2.17s 2.17s
parse 12 31µs 18µs 188µs 188µs
reserve 12 3ms 3ms 4ms 4ms
route 12 609µs 607µs 954µs 954µs
coordinator_to_provider 12 2.144s 2.141s 2.161s 2.161s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=31.083µs (threshold=1ms)
parse:p95<=5ms PASS p95=188µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.85625ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.816ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-queue-saturation

1 providers, 10 users, 40 requests, concurrency=15, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 40
Success 5
Errors 35
Total Duration 3.096s
Throughput 1.6 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 5 2.223s 2.04s 2.952s 2.952s
parse 5 13µs 12µs 22µs 22µs
reserve 5 2ms 2ms 2ms 2ms
route 5 588ms 1ms 2.94s 2.94s
queue_wait 1 2.94s 2.94s 2.94s 2.94s
dispatch 1 103µs 103µs 103µs 103µs
coordinator_to_provider 5 1.629s 2.034s 2.035s 2.035s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=13.2µs (threshold=1ms)
parse:p95<=5ms PASS p95=22µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.9704ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.251ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:mean<=5ms PASS mean=103µs (threshold=5ms)
dispatch:p95<=50ms PASS p95=103µs (threshold=50ms)

3-provider-20-users

3 providers, 20 users, 60 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 60
Success 60
Errors 0
Total Duration 5.462s
Throughput 11.0 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 60 367ms 5ms 2.192s 2.192s
parse 60 15µs 14µs 26µs 36µs
reserve 60 1ms 1ms 3ms 3ms
route 60 452µs 413µs 794µs 976µs
coordinator_to_provider 60 364ms 3ms 2.185s 2.186s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=15.15µs (threshold=1ms)
parse:p95<=5ms PASS p95=26µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.224283ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.808ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-scaling

1 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 30
Success 4
Errors 26
Total Duration 2.936s
Throughput 1.4 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 2.079s 2.079s 2.079s 2.079s
parse 4 26µs 29µs 30µs 30µs
reserve 4 5ms 5ms 5ms 5ms
route 4 777µs 818µs 840µs 840µs
coordinator_to_provider 4 2.07s 2.07s 2.07s 2.07s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=25.75µs (threshold=1ms)
parse:p95<=5ms PASS p95=30µs (threshold=5ms)
reserve:mean<=50ms PASS mean=4.61075ms (threshold=50ms)
reserve:p95<=200ms PASS p95=4.788ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-scaling

3 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 3.896s
Throughput 7.7 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 683ms 6ms 2.06s 2.06s
parse 30 18µs 13µs 46µs 113µs
reserve 30 2ms 1ms 4ms 5ms
route 30 450µs 419µs 649µs 678µs
coordinator_to_provider 30 679ms 4ms 2.054s 2.055s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=18.266µs (threshold=1ms)
parse:p95<=5ms PASS p95=46µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.8116ms (threshold=50ms)
reserve:p95<=200ms PASS p95=4.169ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

5-provider-scaling

5 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 5 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 4.003s
Throughput 7.5 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 735ms 4ms 2.22s 2.22s
parse 30 18µs 15µs 25µs 97µs
reserve 30 1ms 1ms 2ms 3ms
route 30 445µs 450µs 622µs 685µs
coordinator_to_provider 30 731ms 1ms 2.214s 2.215s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=18.033µs (threshold=1ms)
parse:p95<=5ms PASS p95=25µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.3902ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.494ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-heavy-100conc-10kb

3 providers, 20 users, 100 requests, concurrency=100, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 100
Success 12
Errors 88
Total Duration 3.347s
Throughput 3.6 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 12 2.187s 2.173s 2.215s 2.215s
parse 12 116µs 101µs 272µs 272µs
reserve 12 9ms 8ms 10ms 10ms
route 12 20ms 20ms 21ms 21ms
coordinator_to_provider 12 2.146s 2.132s 2.183s 2.183s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=116.416µs (threshold=1ms)
parse:p95<=5ms PASS p95=272µs (threshold=5ms)
reserve:mean<=50ms PASS mean=8.638916ms (threshold=50ms)
reserve:p95<=200ms PASS p95=10.12ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

@anupsv anupsv mentioned this pull request May 22, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant