Implements greedy pipeline-parallel inference over ThunderboltLink with
AES-256-GCM sealed activation tensors. Rank 0 runs layers 0..splitLayer,
seals the activation, and exchanges it with rank 1 (which runs
splitLayer..N + norm + lm_head + argmax) via new ppActivation/ppToken/
ppSessionEnd frame types (0x0B–0x0D). Wires EncryptedPipelineEngine/
Server into ClusterDiscovery as the fallback when jaccl bootstrap fails.
- ClusterControlMessage: add ppActivation, ppToken, ppSessionEnd with
JSON-encoded Codable payloads (PPActivationPayload, PPTokenPayload,
PPSessionEndPayload); update frame-encoding comment table
- EncryptedPipelineInference: rewrite EncryptedPipelineEngine to use new
frame types; introduce PPClusterSession protocol (extends
ClusterSessionSendable with receiveInferenceFrame + currentSessionKey)
for mock-injectable testing; rewrite EncryptedPipelineServer.handleRequest
to dispatch on ppActivation/ppSessionEnd with per-request cache reset
- ClusterModelLoader: add loadLlamaModel(modelDirectory:) for PP (no jaccl
DistributedGroup required, uses plain LlamaModel + callPartial)
- ClusterDiscovery: add _ppEngine/_ppServer; add tryBuildRank0PPEngine/
tryBuildRank1PPServer; currentPPEngine()/currentPPServer() accessors;
jaccl bootstrap failure on either rank now falls back to PP instead of
aborting; ClusterPeer.serve routes ppActivation/ppToken/ppSessionEnd
through inferenceHandler
- Tests: PipelineParallelDecodeTests.swift — 17 tests covering raw values,
JSON round-trips, engine construction, generate() loop, frame sequence,
EOS stopping, ppSessionEnd as final frame, sealed activation
decryptability, and KV cache slicing for both ranks
Summary
ppActivation/ppToken/ppSessionEndframe types (0x0B–0x0D) with JSON-encoded Codable payloads; distinct from the legacyinferenceStep/inferenceTokenraw-bytes protocolEncryptedPipelineEngine(rank 0) andEncryptedPipelineServer(rank 1) fully rewritten with the new frame types,PPClusterSessionprotocol for mock-injectable testing, 3-retry logic, and actor-safe KV cache slicingClusterModelLoader.loadLlamaModelfor single-rankLlamaModelload (no DistributedGroup);ClusterDiscoverywires PP engine/server as jaccl-bootstrap-failure fallback on both ranksPipelineParallelDecodeTests.swift; all TP tests still passArchitecture notes
Double encryption kept:
sealedActivationinppActivationis AES-GCM sealed byTensorCrypto(inner), then the whole frame is AES-GCM sealed again byClusterFrame.encode/decode(outer). This matches the design established byinferenceStep/inferenceTokenand is intentional (defence-in-depth).TensorCrypto's docstring does not flag the outer layer as redundant.KV cache slicing:
EncryptedPipelineEngineinit takesmodel.newCache(parameters: nil).prefix(splitLayer).EncryptedPipelineServertakes.suffix(numLayers - splitLayer). Both are verified in tests.PPClusterSessionprotocol: extendsClusterSessionSendable(from PR 4b) withreceiveInferenceFrame() async throws -> DataandcurrentSessionKey() async throws -> SymmetricKey.ClusterSessionconforms; tests useMockPPSession. The actor-isolation conformance mirrors the existing TP pattern exactly.ClusterModelLoader:
loadLlamaModel(modelDirectory:)added alongside the existingload(modelDirectory:)(forLlamaModelTP). No removal of TP loader; PP just doesn't need the jaccl group.Split point:
ClusterDiscoveryusesnumLayers / 2as the defaultsplitLayerwhen building PP engine/server from the jaccl-failure fallback path. This is informational-only for now; PR 4d can expose this as a config knob.Test plan
swift buildpasses clean (one pre-existing warning inClusterCommand.swift, not introduced by this PR)swift test --filter "PipelineParallelDecode"— 17/17 passswift test --filter "TensorParallelDecode"— 17/17 pass (no regression)Need help on this PR? Tag
@codesmithwith what you need. Autofix is disabled.