feat(cluster): TP decode loop (rank 0 + rank 1) by anupsv · Pull Request #196 · Layr-Labs/d-inference

anupsv · 2026-05-21T18:33:27Z

⚠️ Stacked PR — merge #195 first. This branch is based on feat/jaccl-bootstrap. Until that PR lands on master, this diff will include #195's commits.

What

Implements PR 4b of the tensor-parallel inference stack: the rank-0 decode loop and rank-1 serve loop for LlamaModelTP.

Changes

Protocol

ClusterControlMessage: three new message types — promptTokens (0x08), stepToken (0x09), sessionStop (0x0A) — each with a Codable, Sendable payload struct carrying a request uid
ClusterPeer.serve: jacclBootstrap frames now route to a dedicated bootstrapHandler parameter; inference frames go to inferenceHandler. Clean separation that was a stopgap in PR 4a.

Engine (rank 0) — TensorParallelEngine

Actor-isolated; holds LlamaModelTP via nonisolated(unsafe) for Swift 6 compatibility
generate(promptTokens:maxTokens:eosTokenIDs:) -> AsyncStream<Int>: sends promptTokens to rank 1, prefills, greedily samples, loops stepToken per token, sends sessionStop at end
Greedy-only (no temperature/top-p); singleton group degenerates to standard single-rank inference (allreduce is a no-op)

Server (rank 1) — TensorParallelServer

Actor-isolated; handleFrame(_ data: Data) dispatches on ClusterMsgType
promptTokens → reset KV cache + prefill (discards logits)
stepToken → decode step (discards logits; rank 0 samples)
sessionStop → clear cache

Model loading — ClusterModelLoader

Reads config.json, decodes LlamaConfiguration, calls MLX.DistributedGroup() (reads jaccl env vars set during bootstrap), constructs LlamaModelTP, loads weights via MLXLMCommon.loadWeights
Separate from LLMModelFactory because the factory doesn't thread DistributedGroup through its pipeline

Wiring — ClusterDiscovery

After jaccl bootstrap completes on either rank, attempts to build the engine (rank 0) or server (rank 1) if modelDirectory is set
setModelDirectory(_:) public setter; currentEngine() / currentServer() accessors for the provider serve loop (PR 4d)

Swift 6 Sendable

UncheckedSendableLLMModel (public struct, @unchecked Sendable) wraps any LLMModel for safe single-owner transfers across actor boundaries without sending
All engine/server inits accept UncheckedSendableLLMModel; internal storage uses nonisolated(unsafe) let

Tests (`TensorParallelDecodeTests.swift` — 17 tests, all passing)

Raw values for the three new ClusterMsgType cases
JSON round-trip for all three payload structs
TensorParallelEngine and TensorParallelServer construct without error on singleton group
generate() produces ≤ maxTokens tokens and completes cleanly
generate() is deterministic (same weights + same prompt → same output)
generate() stops at EOS token after exactly 1 token
Frame sequence: promptTokens → stepToken × N → sessionStop
promptTokens frame content (uid, tokens, maxTokens)
handleFrame routing for all three server-side frame types
ClusterPeer.serve signature compilation check (bootstrapHandler present)

Out of scope

LlamaModelTPQ (quantized variant)
Pipeline-parallel decode (PR 4c)
Failure modes / heartbeat / request routing to the engine (PR 4d)
Sampling variants other than greedy

^{Need help on this PR? Tag @codesmith with what you need. Autofix is disabled.}

Add tensor-parallel inference engine and server: - ClusterControlMessage: add promptTokens (0x08), stepToken (0x09), sessionStop (0x0A) message types with Codable payload structs - ClusterSession/ClusterPeer: split jacclBootstrap into its own bootstrapHandler parameter, separate from inferenceHandler - TensorParallelInference: implement TensorParallelEngine (rank 0, greedy decode loop with AsyncStream) and TensorParallelServer (rank 1 frame handler); add ClusterSessionSendable protocol for testability; expose UncheckedSendableLLMModel for Swift 6 safety - ClusterModelLoader: load LlamaModelTP from a model directory using the jaccl DistributedGroup from the process environment - ClusterDiscovery: wire LlamaModelTP construction into the bootstrap completion path; expose currentEngine()/currentServer() accessors - TensorParallelDecodeTests: 17 tests covering message types, payload round-trips, engine/server construction, generate() semantics (maxTokens, EOS, determinism, frame sequence), and handleFrame routing

vercel · 2026-05-21T18:33:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
d-inference	Ready	Preview	May 21, 2026 6:33pm
d-inference-console-ui-dev	Ready	Preview	May 21, 2026 6:33pm
d-inference-landing	Ready	Preview	May 21, 2026 6:33pm

github-actions · 2026-05-21T18:37:43Z

Benchmark Results

Runner: macos-15 (M1 Virtual) | Date: 2026-05-21 18:35 UTC

1-provider-streaming

1 providers, 1 users, 30 requests, concurrency=5, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	30
Success	4
Errors	26
Total Duration	3.76s
Throughput	1.1 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	4	1.691s	1.691s	1.691s	1.691s
parse	4	18µs	15µs	32µs	32µs
reserve	4	3ms	3ms	4ms	4ms
route	4	383µs	415µs	445µs	445µs
coordinator_to_provider	4	1.685s	1.685s	1.686s	1.686s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=17.5µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=32µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=2.5965ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=3.841ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

1-provider-non-streaming

1 providers, 1 users, 20 requests, concurrency=5, streaming=false

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	20
Success	4
Errors	16
Total Duration	2.377s
Throughput	1.7 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	4	2.374s	2.375s	2.376s	2.376s
parse	4	96µs	91µs	196µs	196µs
reserve	4	7ms	8ms	9ms	9ms
route	4	1ms	1ms	3ms	3ms
coordinator_to_provider	4	1.716s	1.717s	1.718s	1.718s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=96.25µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=196µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=7.15ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=8.904ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

7-provider-multi-model

7 providers, 5 users, 50 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	4	0.5 GB
mlx-community/gemma-3-270m-4bit	3	0.2 GB

Metric	Value
Total Requests	50
Success	50
Errors	0
Total Duration	10.533s
Throughput	4.7 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	50	642ms	4ms	3.837s	3.843s
parse	48	18µs	17µs	47µs	53µs
reserve	48	1ms	1ms	3ms	4ms
route	48	436µs	406µs	702µs	790µs
coordinator_to_provider	50	538ms	1ms	3.815s	3.838s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=18.375µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=47µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=1.45227ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=3.01ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

3-provider-high-concurrency

3 providers, 10 users, 60 requests, concurrency=20, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	60
Success	12
Errors	48
Total Duration	3.602s
Throughput	3.3 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	12	2.153s	2.149s	2.17s	2.17s
parse	12	31µs	18µs	188µs	188µs
reserve	12	3ms	3ms	4ms	4ms
route	12	609µs	607µs	954µs	954µs
coordinator_to_provider	12	2.144s	2.141s	2.161s	2.161s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=31.083µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=188µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=2.85625ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=3.816ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

1-provider-queue-saturation

1 providers, 10 users, 40 requests, concurrency=15, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	40
Success	5
Errors	35
Total Duration	3.096s
Throughput	1.6 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	5	2.223s	2.04s	2.952s	2.952s
parse	5	13µs	12µs	22µs	22µs
reserve	5	2ms	2ms	2ms	2ms
route	5	588ms	1ms	2.94s	2.94s
queue_wait	1	2.94s	2.94s	2.94s	2.94s
dispatch	1	103µs	103µs	103µs	103µs
coordinator_to_provider	5	1.629s	2.034s	2.035s	2.035s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=13.2µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=22µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=1.9704ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=2.251ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:mean<=5ms	PASS	mean=103µs (threshold=5ms)
dispatch:p95<=50ms	PASS	p95=103µs (threshold=50ms)

3-provider-20-users

3 providers, 20 users, 60 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	60
Success	60
Errors	0
Total Duration	5.462s
Throughput	11.0 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	60	367ms	5ms	2.192s	2.192s
parse	60	15µs	14µs	26µs	36µs
reserve	60	1ms	1ms	3ms	3ms
route	60	452µs	413µs	794µs	976µs
coordinator_to_provider	60	364ms	3ms	2.185s	2.186s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=15.15µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=26µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=1.224283ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=2.808ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

1-provider-scaling

1 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	30
Success	4
Errors	26
Total Duration	2.936s
Throughput	1.4 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	4	2.079s	2.079s	2.079s	2.079s
parse	4	26µs	29µs	30µs	30µs
reserve	4	5ms	5ms	5ms	5ms
route	4	777µs	818µs	840µs	840µs
coordinator_to_provider	4	2.07s	2.07s	2.07s	2.07s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=25.75µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=30µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=4.61075ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=4.788ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

3-provider-scaling

3 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	30
Success	30
Errors	0
Total Duration	3.896s
Throughput	7.7 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	30	683ms	6ms	2.06s	2.06s
parse	30	18µs	13µs	46µs	113µs
reserve	30	2ms	1ms	4ms	5ms
route	30	450µs	419µs	649µs	678µs
coordinator_to_provider	30	679ms	4ms	2.054s	2.055s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=18.266µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=46µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=1.8116ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=4.169ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

5-provider-scaling

5 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	5	0.5 GB

Metric	Value
Total Requests	30
Success	30
Errors	0
Total Duration	4.003s
Throughput	7.5 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	30	735ms	4ms	2.22s	2.22s
parse	30	18µs	15µs	25µs	97µs
reserve	30	1ms	1ms	2ms	3ms
route	30	445µs	450µs	622µs	685µs
coordinator_to_provider	30	731ms	1ms	2.214s	2.215s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=18.033µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=25µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=1.3902ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=2.494ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

3-provider-heavy-100conc-10kb

3 providers, 20 users, 100 requests, concurrency=100, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	100
Success	12
Errors	88
Total Duration	3.347s
Throughput	3.6 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	12	2.187s	2.173s	2.215s	2.215s
parse	12	116µs	101µs	272µs	272µs
reserve	12	9ms	8ms	10ms	10ms
route	12	20ms	20ms	21ms	21ms
coordinator_to_provider	12	2.146s	2.132s	2.183s	2.183s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=116.416µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=272µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=8.638916ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=10.12ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

vercel Bot deployed to Preview – d-inference May 21, 2026 18:33 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev May 21, 2026 18:33 View deployment

anupsv mentioned this pull request May 22, 2026

feat(cluster): PP decode loop #197

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cluster): TP decode loop (rank 0 + rank 1)#196

feat(cluster): TP decode loop (rank 0 + rank 1)#196
anupsv wants to merge 1 commit into
feat/jaccl-bootstrapfrom
feat/tp-decode-loop

anupsv commented May 21, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anupsv commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Tests (TensorParallelDecodeTests.swift — 17 tests, all passing)

Out of scope

Uh oh!

vercel Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 21, 2026

Benchmark Results

1-provider-streaming

Latency Decomposition

Assertion Report: FAIL

1-provider-non-streaming

Latency Decomposition

Assertion Report: FAIL

7-provider-multi-model

Latency Decomposition

Assertion Report: FAIL

3-provider-high-concurrency

Latency Decomposition

Assertion Report: FAIL

1-provider-queue-saturation

Latency Decomposition

Assertion Report: FAIL

3-provider-20-users

Latency Decomposition

Assertion Report: FAIL

1-provider-scaling

Latency Decomposition

Assertion Report: FAIL

3-provider-scaling

Latency Decomposition

Assertion Report: FAIL

5-provider-scaling

Latency Decomposition

Assertion Report: FAIL

3-provider-heavy-100conc-10kb

Latency Decomposition

Assertion Report: FAIL

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anupsv commented May 21, 2026 •

edited

Loading

Tests (`TensorParallelDecodeTests.swift` — 17 tests, all passing)

vercel Bot commented May 21, 2026 •

edited

Loading