Skip to content

feat(cluster): PP decode loop#197

Open
anupsv wants to merge 1 commit into
feat/tp-decode-loopfrom
feat/pp-decode-loop
Open

feat(cluster): PP decode loop#197
anupsv wants to merge 1 commit into
feat/tp-decode-loopfrom
feat/pp-decode-loop

Conversation

@anupsv
Copy link
Copy Markdown
Contributor

@anupsv anupsv commented May 21, 2026

⚠️ Stacked PR — merge #196 first. This branch is based on feat/tp-decode-loop. Until that PR lands on master, this diff will include #196's commits.


Summary

  • Implements greedy pipeline-parallel (PP) inference over ThunderboltLink with AES-256-GCM sealed activation tensors, as the fallback path when TP (jaccl) is unavailable
  • New ppActivation / ppToken / ppSessionEnd frame types (0x0B–0x0D) with JSON-encoded Codable payloads; distinct from the legacy inferenceStep/inferenceToken raw-bytes protocol
  • EncryptedPipelineEngine (rank 0) and EncryptedPipelineServer (rank 1) fully rewritten with the new frame types, PPClusterSession protocol for mock-injectable testing, 3-retry logic, and actor-safe KV cache slicing
  • ClusterModelLoader.loadLlamaModel for single-rank LlamaModel load (no DistributedGroup); ClusterDiscovery wires PP engine/server as jaccl-bootstrap-failure fallback on both ranks
  • 17 unit tests in PipelineParallelDecodeTests.swift; all TP tests still pass

Architecture notes

Double encryption kept: sealedActivation in ppActivation is AES-GCM sealed by TensorCrypto (inner), then the whole frame is AES-GCM sealed again by ClusterFrame.encode/decode (outer). This matches the design established by inferenceStep/inferenceToken and is intentional (defence-in-depth). TensorCrypto's docstring does not flag the outer layer as redundant.

KV cache slicing: EncryptedPipelineEngine init takes model.newCache(parameters: nil).prefix(splitLayer). EncryptedPipelineServer takes .suffix(numLayers - splitLayer). Both are verified in tests.

PPClusterSession protocol: extends ClusterSessionSendable (from PR 4b) with receiveInferenceFrame() async throws -> Data and currentSessionKey() async throws -> SymmetricKey. ClusterSession conforms; tests use MockPPSession. The actor-isolation conformance mirrors the existing TP pattern exactly.

ClusterModelLoader: loadLlamaModel(modelDirectory:) added alongside the existing load(modelDirectory:) (for LlamaModelTP). No removal of TP loader; PP just doesn't need the jaccl group.

Split point: ClusterDiscovery uses numLayers / 2 as the default splitLayer when building PP engine/server from the jaccl-failure fallback path. This is informational-only for now; PR 4d can expose this as a config knob.

Test plan

  • swift build passes clean (one pre-existing warning in ClusterCommand.swift, not introduced by this PR)
  • swift test --filter "PipelineParallelDecode" — 17/17 pass
  • swift test --filter "TensorParallelDecode" — 17/17 pass (no regression)
  • Two-Mac Thunderbolt smoke test (tracked for PR 4d)

View with Codesmith Autofix with Codesmith
Need help on this PR? Tag @codesmith with what you need. Autofix is disabled.

Implements greedy pipeline-parallel inference over ThunderboltLink with
AES-256-GCM sealed activation tensors. Rank 0 runs layers 0..splitLayer,
seals the activation, and exchanges it with rank 1 (which runs
splitLayer..N + norm + lm_head + argmax) via new ppActivation/ppToken/
ppSessionEnd frame types (0x0B–0x0D). Wires EncryptedPipelineEngine/
Server into ClusterDiscovery as the fallback when jaccl bootstrap fails.

- ClusterControlMessage: add ppActivation, ppToken, ppSessionEnd with
  JSON-encoded Codable payloads (PPActivationPayload, PPTokenPayload,
  PPSessionEndPayload); update frame-encoding comment table
- EncryptedPipelineInference: rewrite EncryptedPipelineEngine to use new
  frame types; introduce PPClusterSession protocol (extends
  ClusterSessionSendable with receiveInferenceFrame + currentSessionKey)
  for mock-injectable testing; rewrite EncryptedPipelineServer.handleRequest
  to dispatch on ppActivation/ppSessionEnd with per-request cache reset
- ClusterModelLoader: add loadLlamaModel(modelDirectory:) for PP (no jaccl
  DistributedGroup required, uses plain LlamaModel + callPartial)
- ClusterDiscovery: add _ppEngine/_ppServer; add tryBuildRank0PPEngine/
  tryBuildRank1PPServer; currentPPEngine()/currentPPServer() accessors;
  jaccl bootstrap failure on either rank now falls back to PP instead of
  aborting; ClusterPeer.serve routes ppActivation/ppToken/ppSessionEnd
  through inferenceHandler
- Tests: PipelineParallelDecodeTests.swift — 17 tests covering raw values,
  JSON round-trips, engine construction, generate() loop, frame sequence,
  EOS stopping, ppSessionEnd as final frame, sealed activation
  decryptability, and KV cache slicing for both ranks
@vercel
Copy link
Copy Markdown

vercel Bot commented May 21, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
d-inference Ready Ready Preview May 21, 2026 6:44pm
d-inference-console-ui-dev Ready Ready Preview May 21, 2026 6:44pm
d-inference-landing Ready Ready Preview May 21, 2026 6:44pm

Request Review

@github-actions
Copy link
Copy Markdown

Benchmark Results

Runner: macos-15 (M1 Virtual) | Date: 2026-05-21 18:46 UTC

1-provider-streaming

1 providers, 1 users, 30 requests, concurrency=5, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 30
Success 4
Errors 26
Total Duration 3.852s
Throughput 1.0 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 1.746s 1.746s 1.746s 1.746s
parse 4 17µs 16µs 30µs 30µs
reserve 4 3ms 3ms 4ms 4ms
route 4 415µs 426µs 430µs 430µs
coordinator_to_provider 4 1.739s 1.739s 1.74s 1.74s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=16.5µs (threshold=1ms)
parse:p95<=5ms PASS p95=30µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.663ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.941ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-non-streaming

1 providers, 1 users, 20 requests, concurrency=5, streaming=false

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 20
Success 4
Errors 16
Total Duration 2.323s
Throughput 1.7 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 2.32s 2.321s 2.323s 2.323s
parse 4 41µs 61µs 63µs 63µs
reserve 4 6ms 6ms 8ms 8ms
route 4 733µs 755µs 803µs 803µs
coordinator_to_provider 4 1.726s 1.727s 1.728s 1.728s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=40.5µs (threshold=1ms)
parse:p95<=5ms PASS p95=63µs (threshold=5ms)
reserve:mean<=50ms PASS mean=5.55025ms (threshold=50ms)
reserve:p95<=200ms PASS p95=7.624ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

7-provider-multi-model

7 providers, 5 users, 50 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 4 0.5 GB
mlx-community/gemma-3-270m-4bit 3 0.2 GB
Metric Value
Total Requests 50
Success 50
Errors 0
Total Duration 10.615s
Throughput 4.7 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 50 619ms 4ms 3.66s 3.699s
parse 50 17µs 17µs 28µs 57µs
reserve 50 1ms 1ms 2ms 3ms
route 50 0s 0s 1ms 1ms
coordinator_to_provider 50 615ms 1ms 3.65s 3.693s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=17.42µs (threshold=1ms)
parse:p95<=5ms PASS p95=28µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.39302ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.371ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-high-concurrency

3 providers, 10 users, 60 requests, concurrency=20, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 60
Success 12
Errors 48
Total Duration 3.536s
Throughput 3.4 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 12 2.172s 2.164s 2.211s 2.211s
parse 12 13µs 12µs 23µs 23µs
reserve 12 3ms 2ms 4ms 4ms
route 12 519µs 471µs 853µs 853µs
coordinator_to_provider 12 2.164s 2.157s 2.205s 2.205s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=13.166µs (threshold=1ms)
parse:p95<=5ms PASS p95=23µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.5155ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.996ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-queue-saturation

1 providers, 10 users, 40 requests, concurrency=15, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 40
Success 4
Errors 36
Total Duration 3.166s
Throughput 1.3 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 2.257s 2.257s 2.257s 2.257s
parse 4 13µs 15µs 22µs 22µs
reserve 4 2ms 2ms 2ms 2ms
route 4 502µs 474µs 644µs 644µs
coordinator_to_provider 4 2.251s 2.251s 2.252s 2.252s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=12.75µs (threshold=1ms)
parse:p95<=5ms PASS p95=22µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.2715ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.43ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-20-users

3 providers, 20 users, 60 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 60
Success 60
Errors 0
Total Duration 5.424s
Throughput 11.1 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 60 348ms 5ms 2.088s 2.088s
parse 60 27µs 16µs 65µs 375µs
reserve 60 1ms 1ms 3ms 4ms
route 60 452µs 418µs 726µs 964µs
coordinator_to_provider 60 345ms 2ms 2.08s 2.081s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=26.883µs (threshold=1ms)
parse:p95<=5ms PASS p95=65µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.4206ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.389ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

1-provider-scaling

1 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 1 0.5 GB
Metric Value
Total Requests 30
Success 4
Errors 26
Total Duration 2.711s
Throughput 1.5 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 4 2.008s 2.008s 2.008s 2.008s
parse 4 15µs 12µs 26µs 26µs
reserve 4 2ms 2ms 3ms 3ms
route 4 480µs 507µs 529µs 529µs
coordinator_to_provider 4 2.003s 2.003s 2.003s 2.003s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=14.75µs (threshold=1ms)
parse:p95<=5ms PASS p95=26µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.41675ms (threshold=50ms)
reserve:p95<=200ms PASS p95=2.828ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-scaling

3 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 3.945s
Throughput 7.6 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 715ms 7ms 2.156s 2.156s
parse 30 21µs 17µs 51µs 54µs
reserve 30 2ms 1ms 4ms 5ms
route 30 512µs 503µs 843µs 870µs
coordinator_to_provider 30 709ms 4ms 2.144s 2.145s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=21.166µs (threshold=1ms)
parse:p95<=5ms PASS p95=51µs (threshold=5ms)
reserve:mean<=50ms PASS mean=2.007733ms (threshold=50ms)
reserve:p95<=200ms PASS p95=4.242ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

5-provider-scaling

5 providers, 5 users, 30 requests, concurrency=10, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 5 0.5 GB
Metric Value
Total Requests 30
Success 30
Errors 0
Total Duration 4.124s
Throughput 7.3 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 30 713ms 6ms 2.143s 2.143s
parse 30 30µs 19µs 105µs 236µs
reserve 30 2ms 2ms 4ms 4ms
route 30 1ms 0s 1ms 1ms
coordinator_to_provider 30 708ms 2ms 2.133s 2.137s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=30µs (threshold=1ms)
parse:p95<=5ms PASS p95=105µs (threshold=5ms)
reserve:mean<=50ms PASS mean=1.8233ms (threshold=50ms)
reserve:p95<=200ms PASS p95=3.555ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

3-provider-heavy-100conc-10kb

3 providers, 20 users, 100 requests, concurrency=100, streaming=true

Model Providers RAM
mlx-community/Qwen3.5-0.8B-MLX-4bit 3 0.5 GB
Metric Value
Total Requests 100
Success 12
Errors 88
Total Duration 2.789s
Throughput 4.3 req/s

Latency Decomposition

Segment Count Mean P50 P95 Max
total_e2e 12 1.943s 1.945s 1.948s 1.948s
parse 12 165µs 133µs 426µs 426µs
reserve 12 9ms 9ms 9ms 9ms
route 12 17ms 17ms 17ms 17ms
coordinator_to_provider 12 1.9s 1.903s 1.904s 1.904s

Assertion Report: FAIL

Assertion Result Detail
parse:mean<=1ms PASS mean=165.25µs (threshold=1ms)
parse:p95<=5ms PASS p95=426µs (threshold=5ms)
reserve:mean<=50ms PASS mean=8.719166ms (threshold=50ms)
reserve:p95<=200ms PASS p95=9.283ms (threshold=200ms)
encrypt:present FAIL no data for segment encrypt
dispatch:present FAIL no data for segment dispatch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant