Source side remote view transport#163
Draft
zhou-yuhan wants to merge 12 commits into
Draft
Conversation
…duckDB (e.g. with python>=3.11)
…_mapped_source; 2) reuse StridedBlock cache; 3) add direct gather path for CPU mem in strided load
…head for CPU view export
…HUGEPAGE for CPU export
…ifecycle management
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Remote TP-sliced loads/updates were still hitting the wrong slow path: when a requested
view_idwas not already routable, the destination daemon fell back to canonical transportand reconstructed the TP slice locally. That preserved correctness, but caused destination-
side read amplification, strided repack, and repeated reconstruction across daemons.
We also found lifecycle gaps in the first source-side upgrade path:
What this PR does
Suppose daemon B wants to fetch tensor views stored on daemon A.
by daemon A, instead of reconstructing from canonical bytes locally.
(artifact_id, view_id, device)pending/ready/drainingstateBeginReplicaFetch/EndReplicaFetchso the source daemon tracks real data-plane use:active_fetchesprotection for in-flight transfersnot pinned-allocation timeout.
and fallback paths.
Test results
Unit tests
Added focused tests for the new lifecycle manager:
(artifact_id, view_id, device)SGLang integration
Validated with the remote relay benchmark harness on two workers.
Remote load (
load_weight_remote, tensorcast relay):qwen3-14b/qwen3-32btp=1,2,4trial=3Examples:
qwen3-14b tp=4:26s -> 5s -> 5sqwen3-32b tp=4:23s -> 11s -> 12sRemote update (
update_weight_remote, tensorcast relay):qwen3-14b/qwen3-32btp=1,2,4trial=3Examples:
qwen3-32b tp=2:40s / 34s / 38sqwen3-32b tp=4:44s / 43s / 43sSummary
This PR makes source-side remote view transport production-safe: