feat(cluster): PP decode loop by anupsv · Pull Request #197 · Layr-Labs/d-inference

anupsv · 2026-05-21T18:43:52Z

⚠️ Stacked PR — merge #196 first. This branch is based on feat/tp-decode-loop. Until that PR lands on master, this diff will include #196's commits.

Summary

Implements greedy pipeline-parallel (PP) inference over ThunderboltLink with AES-256-GCM sealed activation tensors, as the fallback path when TP (jaccl) is unavailable
New ppActivation / ppToken / ppSessionEnd frame types (0x0B–0x0D) with JSON-encoded Codable payloads; distinct from the legacy inferenceStep/inferenceToken raw-bytes protocol
EncryptedPipelineEngine (rank 0) and EncryptedPipelineServer (rank 1) fully rewritten with the new frame types, PPClusterSession protocol for mock-injectable testing, 3-retry logic, and actor-safe KV cache slicing
ClusterModelLoader.loadLlamaModel for single-rank LlamaModel load (no DistributedGroup); ClusterDiscovery wires PP engine/server as jaccl-bootstrap-failure fallback on both ranks
17 unit tests in PipelineParallelDecodeTests.swift; all TP tests still pass

Architecture notes

Double encryption kept: sealedActivation in ppActivation is AES-GCM sealed by TensorCrypto (inner), then the whole frame is AES-GCM sealed again by ClusterFrame.encode/decode (outer). This matches the design established by inferenceStep/inferenceToken and is intentional (defence-in-depth). TensorCrypto's docstring does not flag the outer layer as redundant.

KV cache slicing: EncryptedPipelineEngine init takes model.newCache(parameters: nil).prefix(splitLayer). EncryptedPipelineServer takes .suffix(numLayers - splitLayer). Both are verified in tests.

PPClusterSession protocol: extends ClusterSessionSendable (from PR 4b) with receiveInferenceFrame() async throws -> Data and currentSessionKey() async throws -> SymmetricKey. ClusterSession conforms; tests use MockPPSession. The actor-isolation conformance mirrors the existing TP pattern exactly.

ClusterModelLoader: loadLlamaModel(modelDirectory:) added alongside the existing load(modelDirectory:) (for LlamaModelTP). No removal of TP loader; PP just doesn't need the jaccl group.

Split point: ClusterDiscovery uses numLayers / 2 as the default splitLayer when building PP engine/server from the jaccl-failure fallback path. This is informational-only for now; PR 4d can expose this as a config knob.

Test plan

swift build passes clean (one pre-existing warning in ClusterCommand.swift, not introduced by this PR)
swift test --filter "PipelineParallelDecode" — 17/17 pass
swift test --filter "TensorParallelDecode" — 17/17 pass (no regression)
Two-Mac Thunderbolt smoke test (tracked for PR 4d)

^{Need help on this PR? Tag @codesmith with what you need. Autofix is disabled.}

Implements greedy pipeline-parallel inference over ThunderboltLink with AES-256-GCM sealed activation tensors. Rank 0 runs layers 0..splitLayer, seals the activation, and exchanges it with rank 1 (which runs splitLayer..N + norm + lm_head + argmax) via new ppActivation/ppToken/ ppSessionEnd frame types (0x0B–0x0D). Wires EncryptedPipelineEngine/ Server into ClusterDiscovery as the fallback when jaccl bootstrap fails. - ClusterControlMessage: add ppActivation, ppToken, ppSessionEnd with JSON-encoded Codable payloads (PPActivationPayload, PPTokenPayload, PPSessionEndPayload); update frame-encoding comment table - EncryptedPipelineInference: rewrite EncryptedPipelineEngine to use new frame types; introduce PPClusterSession protocol (extends ClusterSessionSendable with receiveInferenceFrame + currentSessionKey) for mock-injectable testing; rewrite EncryptedPipelineServer.handleRequest to dispatch on ppActivation/ppSessionEnd with per-request cache reset - ClusterModelLoader: add loadLlamaModel(modelDirectory:) for PP (no jaccl DistributedGroup required, uses plain LlamaModel + callPartial) - ClusterDiscovery: add _ppEngine/_ppServer; add tryBuildRank0PPEngine/ tryBuildRank1PPServer; currentPPEngine()/currentPPServer() accessors; jaccl bootstrap failure on either rank now falls back to PP instead of aborting; ClusterPeer.serve routes ppActivation/ppToken/ppSessionEnd through inferenceHandler - Tests: PipelineParallelDecodeTests.swift — 17 tests covering raw values, JSON round-trips, engine construction, generate() loop, frame sequence, EOS stopping, ppSessionEnd as final frame, sealed activation decryptability, and KV cache slicing for both ranks

vercel · 2026-05-21T18:43:57Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
d-inference	Ready	Preview	May 21, 2026 6:44pm
d-inference-console-ui-dev	Ready	Preview	May 21, 2026 6:44pm
d-inference-landing	Ready	Preview	May 21, 2026 6:44pm

github-actions · 2026-05-21T18:48:05Z

Benchmark Results

Runner: macos-15 (M1 Virtual) | Date: 2026-05-21 18:46 UTC

1-provider-streaming

1 providers, 1 users, 30 requests, concurrency=5, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	30
Success	4
Errors	26
Total Duration	3.852s
Throughput	1.0 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	4	1.746s	1.746s	1.746s	1.746s
parse	4	17µs	16µs	30µs	30µs
reserve	4	3ms	3ms	4ms	4ms
route	4	415µs	426µs	430µs	430µs
coordinator_to_provider	4	1.739s	1.739s	1.74s	1.74s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=16.5µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=30µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=2.663ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=3.941ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

1-provider-non-streaming

1 providers, 1 users, 20 requests, concurrency=5, streaming=false

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	20
Success	4
Errors	16
Total Duration	2.323s
Throughput	1.7 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	4	2.32s	2.321s	2.323s	2.323s
parse	4	41µs	61µs	63µs	63µs
reserve	4	6ms	6ms	8ms	8ms
route	4	733µs	755µs	803µs	803µs
coordinator_to_provider	4	1.726s	1.727s	1.728s	1.728s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=40.5µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=63µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=5.55025ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=7.624ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

7-provider-multi-model

7 providers, 5 users, 50 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	4	0.5 GB
mlx-community/gemma-3-270m-4bit	3	0.2 GB

Metric	Value
Total Requests	50
Success	50
Errors	0
Total Duration	10.615s
Throughput	4.7 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	50	619ms	4ms	3.66s	3.699s
parse	50	17µs	17µs	28µs	57µs
reserve	50	1ms	1ms	2ms	3ms
route	50	0s	0s	1ms	1ms
coordinator_to_provider	50	615ms	1ms	3.65s	3.693s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=17.42µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=28µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=1.39302ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=2.371ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

3-provider-high-concurrency

3 providers, 10 users, 60 requests, concurrency=20, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	60
Success	12
Errors	48
Total Duration	3.536s
Throughput	3.4 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	12	2.172s	2.164s	2.211s	2.211s
parse	12	13µs	12µs	23µs	23µs
reserve	12	3ms	2ms	4ms	4ms
route	12	519µs	471µs	853µs	853µs
coordinator_to_provider	12	2.164s	2.157s	2.205s	2.205s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=13.166µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=23µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=2.5155ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=3.996ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

1-provider-queue-saturation

1 providers, 10 users, 40 requests, concurrency=15, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	40
Success	4
Errors	36
Total Duration	3.166s
Throughput	1.3 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	4	2.257s	2.257s	2.257s	2.257s
parse	4	13µs	15µs	22µs	22µs
reserve	4	2ms	2ms	2ms	2ms
route	4	502µs	474µs	644µs	644µs
coordinator_to_provider	4	2.251s	2.251s	2.252s	2.252s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=12.75µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=22µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=2.2715ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=2.43ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

3-provider-20-users

3 providers, 20 users, 60 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	60
Success	60
Errors	0
Total Duration	5.424s
Throughput	11.1 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	60	348ms	5ms	2.088s	2.088s
parse	60	27µs	16µs	65µs	375µs
reserve	60	1ms	1ms	3ms	4ms
route	60	452µs	418µs	726µs	964µs
coordinator_to_provider	60	345ms	2ms	2.08s	2.081s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=26.883µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=65µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=1.4206ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=3.389ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

1-provider-scaling

1 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	1	0.5 GB

Metric	Value
Total Requests	30
Success	4
Errors	26
Total Duration	2.711s
Throughput	1.5 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	4	2.008s	2.008s	2.008s	2.008s
parse	4	15µs	12µs	26µs	26µs
reserve	4	2ms	2ms	3ms	3ms
route	4	480µs	507µs	529µs	529µs
coordinator_to_provider	4	2.003s	2.003s	2.003s	2.003s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=14.75µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=26µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=2.41675ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=2.828ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

3-provider-scaling

3 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	30
Success	30
Errors	0
Total Duration	3.945s
Throughput	7.6 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	30	715ms	7ms	2.156s	2.156s
parse	30	21µs	17µs	51µs	54µs
reserve	30	2ms	1ms	4ms	5ms
route	30	512µs	503µs	843µs	870µs
coordinator_to_provider	30	709ms	4ms	2.144s	2.145s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=21.166µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=51µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=2.007733ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=4.242ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

5-provider-scaling

5 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	5	0.5 GB

Metric	Value
Total Requests	30
Success	30
Errors	0
Total Duration	4.124s
Throughput	7.3 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	30	713ms	6ms	2.143s	2.143s
parse	30	30µs	19µs	105µs	236µs
reserve	30	2ms	2ms	4ms	4ms
route	30	1ms	0s	1ms	1ms
coordinator_to_provider	30	708ms	2ms	2.133s	2.137s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=30µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=105µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=1.8233ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=3.555ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

3-provider-heavy-100conc-10kb

3 providers, 20 users, 100 requests, concurrency=100, streaming=true

Model	Providers	RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit	3	0.5 GB

Metric	Value
Total Requests	100
Success	12
Errors	88
Total Duration	2.789s
Throughput	4.3 req/s

Latency Decomposition

Segment	Count	Mean	P50	P95	Max
total_e2e	12	1.943s	1.945s	1.948s	1.948s
parse	12	165µs	133µs	426µs	426µs
reserve	12	9ms	9ms	9ms	9ms
route	12	17ms	17ms	17ms	17ms
coordinator_to_provider	12	1.9s	1.903s	1.904s	1.904s

Assertion Report: FAIL

Assertion	Result	Detail
parse:mean<=1ms	PASS	mean=165.25µs (threshold=1ms)
parse:p95<=5ms	PASS	p95=426µs (threshold=5ms)
reserve:mean<=50ms	PASS	mean=8.719166ms (threshold=50ms)
reserve:p95<=200ms	PASS	p95=9.283ms (threshold=200ms)
encrypt:present	FAIL	no data for segment encrypt
dispatch:present	FAIL	no data for segment dispatch

vercel Bot deployed to Preview – d-inference May 21, 2026 18:44 View deployment

vercel Bot deployed to Preview – d-inference-console-ui-dev May 21, 2026 18:44 View deployment

anupsv mentioned this pull request May 22, 2026

feat(cluster): provider request routing + rank-1 coordinator opt-out + failure modes #198

Open

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cluster): PP decode loop#197

feat(cluster): PP decode loop#197
anupsv wants to merge 1 commit into
feat/tp-decode-loopfrom
feat/pp-decode-loop

anupsv commented May 21, 2026 •

edited

Loading

Uh oh!

vercel Bot commented May 21, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anupsv commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture notes

Test plan

Uh oh!

vercel Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented May 21, 2026

Benchmark Results

1-provider-streaming

Latency Decomposition

Assertion Report: FAIL

1-provider-non-streaming

Latency Decomposition

Assertion Report: FAIL

7-provider-multi-model

Latency Decomposition

Assertion Report: FAIL

3-provider-high-concurrency

Latency Decomposition

Assertion Report: FAIL

1-provider-queue-saturation

Latency Decomposition

Assertion Report: FAIL

3-provider-20-users

Latency Decomposition

Assertion Report: FAIL

1-provider-scaling

Latency Decomposition

Assertion Report: FAIL

3-provider-scaling

Latency Decomposition

Assertion Report: FAIL

5-provider-scaling

Latency Decomposition

Assertion Report: FAIL

3-provider-heavy-100conc-10kb

Latency Decomposition

Assertion Report: FAIL

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anupsv commented May 21, 2026 •

edited

Loading

vercel Bot commented May 21, 2026 •

edited

Loading