feat: replace mx-source/mx-target with unified mx loader#147
feat: replace mx-source/mx-target with unified mx loader#147nicolasnoble wants to merge 7 commits intomainfrom
Conversation
Consolidated the duplicated project context across CLAUDE.md, Copilot instructions, and Cursor rules into concise pointers to new standalone reference docs. Extracted architecture details into docs/ARCHITECTURE.md and deployment/configuration info into docs/DEPLOYMENT.md. Updated CONTRIBUTING.md, README.md, and example READMEs to match the new structure. Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>
- Fix copy-paste artifact in CONTRIBUTING.md ("The Dynamo project" -> "The ModelExpress project")
- Add ARCHITECTURE.md and DEPLOYMENT.md to the docs/ directory tree in ARCHITECTURE.md
- Add missing min_free_space_bytes field to the ServerConfig YAML example in ARCHITECTURE.md to match DEPLOYMENT.md and the actual config struct
- Add REDIS_URL to the P2P environment variables table in DEPLOYMENT.md, consistent with CONTRIBUTING.md and the K8s examples
Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>
The "Enable P2P Transfers via NIXL" PR (#135) introduced significant new functionality without updating all documentation files per the project's documentation rules. This commit closes the remaining gaps: ARCHITECTURE.md: - Fix __init__.py description (no longer claims auto-registration on import) - Add --worker-cls usage to vllm_worker.py description - Add check_session_changed() and close() to MxClient methods table - Fix NixlTransferManager: correct __init__ params, add get_registered_descriptors(), fix destroy() -> shutdown(), update receive_from_source signature - Correct MxTargetModelLoader base class to DummyModelLoader - Document transfer-time coalescing of contiguous regions - Document _raw_tensor_registry and _nixl_managers globals - Add MODEL_NAME and MX_SERVER_ADDRESS environment variables DEPLOYMENT.md: - Add MODEL_NAME and MX_SERVER_ADDRESS to P2P env var table - Add --worker-cls usage note for vLLM instances CONTRIBUTING.md: - Add Python 3.10+ to prerequisites - Add pip install -e for Python client dev setup - Add pytest and generate_proto.sh to Available Commands table Also includes pre-commit auto-fixes: trailing whitespace cleanup and missing final newlines in several files. Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>
- "not complete" -> "incomplete" in CLAUDE.md, copilot-instructions.md, and rust.mdc - "Read files" -> "Always read files" in copilot-instructions.md and rust.mdc to match CLAUDE.md - Vary repeated "For" sentence openers in README.md Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>
Replace the two separate loaders (mx-source, mx-target) with a single MxModelLoader registered as --load-format mx. It auto-detects whether to load model weights from disk or receive them via RDMA from an existing source, eliminating the need to manually label nodes. - Add MxModelLoader with one-shot _detect_source() check against MX server: if ready source exists, receive via RDMA; otherwise load from disk. Both paths register with NIXL and publish metadata so future nodes can discover this one - Remove MxSourceModelLoader and MxTargetModelLoader classes - Extract shared helpers (_collect_cuda_tensors, _init_nixl_manager, _log_tensor_summary, _publish_metadata_and_ready) from duplicated code - Delete vllm-target.yaml, rename vllm-source.yaml to vllm.yaml with --load-format mx - Add unit tests for shared helpers and detection logic - Update documentation (ARCHITECTURE.md, DEPLOYMENT.md, CONTRIBUTING.md, K8s README) Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>
WalkthroughThis PR reorganizes project documentation to emphasize development practices and deployment guides, consolidates the Python vLLM loader implementation from a dual-source/target approach into a unified auto-detecting loader, and simplifies Kubernetes deployment patterns and naming conventions. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| end | ||
| A -- "RDMA via NIXL" --> B | ||
| A --> S | ||
| B --> S |
There was a problem hiding this comment.
Should the arrow be oppposite?
| B --> S | |
| B <-- S |
| target_device = torch.device(load_device) | ||
| logger.debug(f"Target device: {target_device}") | ||
| device_id = _get_worker_rank(target_device) | ||
| model_name = os.environ.get("MODEL_NAME", "") |
There was a problem hiding this comment.
We can (and probably should?) get model name from model_config
| if not model_name: | ||
| logger.debug(f"[Worker {device_id}] MODEL_NAME not set, defaulting to source") | ||
| return None |
There was a problem hiding this comment.
Model name is always provided through model_config
| transfer_retries = 120 | ||
| transfer_retry_delay = 30 | ||
| cached_session_id = None |
There was a problem hiding this comment.
Can we pull these constants out instead of leaving them as local vars?
| def _receive_from_peer( | ||
| self, | ||
| model: nn.Module, | ||
| device: torch.device, |
| model_config: ModelConfig, | ||
| target_device: torch.device, | ||
| device_id: int, | ||
| model_name: str, |
There was a problem hiding this comment.
do we need to pass this param if we can get it from model_config?
| logger.info(f"[Worker {device_id}] Processing weights (FP8 transformation)...") | ||
| process_weights_after_loading(model, model_config, target_device) | ||
| logger.info(f"[Worker {device_id}] Weight processing complete") |
There was a problem hiding this comment.
nit suggestion:
| logger.info(f"[Worker {device_id}] Processing weights (FP8 transformation)...") | |
| process_weights_after_loading(model, model_config, target_device) | |
| logger.info(f"[Worker {device_id}] Weight processing complete") | |
| process_weights_after_loading(model, model_config, target_device) |
| """Load weights from disk, then register + publish.""" | ||
| logger.info(f"[Worker {device_id}] Loading weights from disk...") | ||
| self._disk_loader.load_weights(model, model_config) | ||
| logger.info(f"[Worker {device_id}] Weights loaded from disk") |
There was a problem hiding this comment.
nit suggestion:
| logger.info(f"[Worker {device_id}] Weights loaded from disk") |
| logger.info(f"[Worker {device_id}] Processing weights (FP8 transformation)...") | ||
| process_weights_after_loading(model, model_config, target_device) | ||
| logger.info(f"[Worker {device_id}] Weight processing complete") |
There was a problem hiding this comment.
nit suggestion:
| logger.info(f"[Worker {device_id}] Processing weights (FP8 transformation)...") | |
| process_weights_after_loading(model, model_config, target_device) | |
| logger.info(f"[Worker {device_id}] Weight processing complete") | |
| process_weights_after_loading(model, model_config, target_device) |
There was a problem hiding this comment.
Several questions. If I start a source node with TP8, now I start a second node with exactly the same model but TP4, how will this be resolved? Or maybe source node is TP4, second node is PP4? Is the metadata containing enough info for such cases? Will it detect that it can not fetch from source?
| self._receive_from_peer(model, target_device, device_id, model_name, source_worker) | ||
|
|
||
| # Register with NIXL + publish so future nodes can discover us | ||
| self._register_and_publish(model, target_device, device_id, model_name) |
There was a problem hiding this comment.
In this scenario, we already have a source having the same info so this target can load? Is it necessary to register and publish again?
# Conflicts: # examples/p2p_transfer_k8s/deploy/vllm-source.yaml # examples/p2p_transfer_k8s/deploy/vllm-target.yaml # examples/p2p_transfer_k8s/vllm-source.yaml # examples/p2p_transfer_k8s/vllm.yaml # modelexpress_client/python/modelexpress/vllm_loader.py
MxModelLoader was missing download_model() and load_weights() from the BaseModelLoader ABC, causing a TypeError crash at instantiation on k8s. Both methods delegate to the internal disk loader (DefaultModelLoader). - Add conftest.py to mock vLLM modules for local test execution - Add regression tests for abstract method completeness and delegation Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
examples/p2p_transfer_k8s/deploy/persistence/vllm-target-redis.yaml (2)
121-128:⚠️ Potential issue | 🔴 CriticalSwitch
mx-targetto unifiedmxto prevent boot failure.Line 127 still uses
--load-format mx-target. With unified loader registration, this can fail at startup due to an unsupported load format.🔧 Proposed fix
- # Start vLLM with mx-target loader - # mx-target: Creates dummy weights, receives RAW tensors via RDMA, - # THEN runs FP8 processing (identical to source) - # This ensures weight_scale_inv is transferred BEFORE being processed + # Start vLLM with unified mx loader (auto-detect source vs disk) python3 -m vllm.entrypoints.openai.api_server \ --model ${MODEL_NAME} \ - --load-format mx-target \ + --load-format mx \ --tensor-parallel-size 8 \ --enable-expert-parallel &🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/p2p_transfer_k8s/deploy/persistence/vllm-target-redis.yaml` around lines 121 - 128, The startup uses the vLLM OpenAI API server entrypoint (python3 -m vllm.entrypoints.openai.api_server) with the flag --load-format set to the deprecated/unsupported "mx-target"; change that flag value to the unified loader name "mx" so the server uses the registered unified loader (replace --load-format mx-target with --load-format mx in the command that starts the server).
132-143:⚠️ Potential issue | 🟠 MajorAdd timeout and process-liveness checks to readiness wait loop.
Lines 137-143 loop forever on health failures. If the vLLM process exits early (or never becomes healthy), the pod can hang indefinitely instead of failing fast.
🔧 Proposed fix
echo "Waiting for vLLM to be ready..." - python3 -c ' - import time - import urllib.request - while True: - try: - urllib.request.urlopen("http://localhost:8000/health", timeout=3) - break - except Exception: - time.sleep(3) - ' + python3 - "$VLLM_PID" <<'PY' + import os + import sys + import time + import urllib.request + + pid = int(sys.argv[1]) + deadline = time.time() + 900 # 15 min max wait + + while True: + if time.time() > deadline: + raise SystemExit("Timed out waiting for vLLM health endpoint") + try: + urllib.request.urlopen("http://localhost:8000/health", timeout=3) + break + except Exception: + try: + os.kill(pid, 0) # still alive? + except OSError: + raise SystemExit("vLLM process exited before becoming healthy") + time.sleep(3) + PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/p2p_transfer_k8s/deploy/persistence/vllm-target-redis.yaml` around lines 132 - 143, The current python3 -c health-wait loop that polls "http://localhost:8000/health" in a while True can hang forever; update that block to enforce a total timeout (e.g., configurable seconds) and to perform a process-liveness check so it fails fast if vLLM exits (instead of looping indefinitely). Concretely, modify the python3 -c script in the readiness block (the while True loop that calls urllib.request.urlopen) to track elapsed time and raise SystemExit/non-zero when the timeout is exceeded, and also verify the target process (e.g., PID 1 or the vLLM process) is still running between attempts—exit non-zero if the process is gone. Ensure the script prints a clear error and returns non-zero so the pod fails readiness if health never becomes OK or the process dies.
🧹 Nitpick comments (3)
modelexpress_client/python/modelexpress/vllm_loader.py (2)
225-229: Uselogging.exceptionfor cleaner stack trace logging.The static analysis correctly identifies that when logging an exception in an except block,
logging.exception()automatically includes the traceback without needing manualtraceback.format_exc().♻️ Suggested fix
except Exception as e: - import traceback - logger.error(f"[Worker {device_id}] EXCEPTION publishing metadata: {e}") - logger.error(f"[Worker {device_id}] Traceback: {traceback.format_exc()}") + logger.exception(f"[Worker {device_id}] EXCEPTION publishing metadata: {e}") raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelexpress_client/python/modelexpress/vllm_loader.py` around lines 225 - 229, Replace the two logger.error calls inside the exception handler that currently log e and traceback.format_exc() with a single logger.exception call so the stack trace is automatically included; specifically, in the except Exception as e block (the handler referencing device_id in modelexpress.vllm_loader.py) call logger.exception with a clear message like "[Worker {device_id}] EXCEPTION publishing metadata" to capture the exception and traceback cleanly.
458-463: Silent exception swallowing loses diagnostic information.The
try/except/passblock silently ignores errors when capturing the session ID. Consider logging at debug level to aid troubleshooting.♻️ Suggested fix
try: ready_resp = self._mx_client.get_ready(model_name, device_id) if ready_resp.found: cached_session_id = ready_resp.session_id - except Exception: - pass + except Exception as e: + logger.debug(f"[Worker {device_id}] Could not capture session ID: {e}")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelexpress_client/python/modelexpress/vllm_loader.py` around lines 458 - 463, The try/except around self._mx_client.get_ready is swallowing exceptions and losing diagnostics; update the except to capture the exception (except Exception as e) and log it at debug level using the module/class logger (e.g., self._logger.debug or the existing logger) including context like model_name, device_id and the exception, while preserving the behavior of not rethrowing so cached_session_id is only set when ready_resp.found.modelexpress_client/python/tests/test_vllm_loader.py (1)
279-280: Simplify the assertion forpublish_readycall.The dual-check (
call_kwargs[1]["model_name"]orcall_kwargs.kwargs.get("model_name")) is fragile and depends on how the mock captures arguments. Consider using a cleaner approach.♻️ Suggested fix
- call_kwargs = mx_client.publish_ready.call_args - assert call_kwargs[1]["model_name"] == "test-model" or call_kwargs.kwargs.get("model_name") == "test-model" + mx_client.publish_ready.assert_called_once() + _, kwargs = mx_client.publish_ready.call_args + assert kwargs["model_name"] == "test-model"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelexpress_client/python/tests/test_vllm_loader.py` around lines 279 - 280, Replace the fragile dual-access assertion with a single, clear check using the mock's call_args tuple unpacking or assert_called_with: capture the kwargs via "_, kwargs = mx_client.publish_ready.call_args" (or directly use "mx_client.publish_ready.assert_called_with(model_name='test-model')" ) and assert "kwargs['model_name'] == 'test-model'"; this targets the publish_ready mock reliably and removes the brittle indexing approach.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/p2p_transfer_k8s/model-download.yaml`:
- Line 10: Update the usage/deploy instructions to include a Redis flush safety
step before redeploying vLLM: add a note immediately after the "Deploy
deploy/vllm.yaml" guidance (the comment line "# 4. Deploy deploy/vllm.yaml")
instructing operators to run redis-cli FLUSHALL to clear Redis metadata before
deleting/redeploying vLLM instances; ensure the note clearly states this is
required to avoid stale metadata during redeploy flows.
---
Outside diff comments:
In `@examples/p2p_transfer_k8s/deploy/persistence/vllm-target-redis.yaml`:
- Around line 121-128: The startup uses the vLLM OpenAI API server entrypoint
(python3 -m vllm.entrypoints.openai.api_server) with the flag --load-format set
to the deprecated/unsupported "mx-target"; change that flag value to the unified
loader name "mx" so the server uses the registered unified loader (replace
--load-format mx-target with --load-format mx in the command that starts the
server).
- Around line 132-143: The current python3 -c health-wait loop that polls
"http://localhost:8000/health" in a while True can hang forever; update that
block to enforce a total timeout (e.g., configurable seconds) and to perform a
process-liveness check so it fails fast if vLLM exits (instead of looping
indefinitely). Concretely, modify the python3 -c script in the readiness block
(the while True loop that calls urllib.request.urlopen) to track elapsed time
and raise SystemExit/non-zero when the timeout is exceeded, and also verify the
target process (e.g., PID 1 or the vLLM process) is still running between
attempts—exit non-zero if the process is gone. Ensure the script prints a clear
error and returns non-zero so the pod fails readiness if health never becomes OK
or the process dies.
---
Nitpick comments:
In `@modelexpress_client/python/modelexpress/vllm_loader.py`:
- Around line 225-229: Replace the two logger.error calls inside the exception
handler that currently log e and traceback.format_exc() with a single
logger.exception call so the stack trace is automatically included;
specifically, in the except Exception as e block (the handler referencing
device_id in modelexpress.vllm_loader.py) call logger.exception with a clear
message like "[Worker {device_id}] EXCEPTION publishing metadata" to capture the
exception and traceback cleanly.
- Around line 458-463: The try/except around self._mx_client.get_ready is
swallowing exceptions and losing diagnostics; update the except to capture the
exception (except Exception as e) and log it at debug level using the
module/class logger (e.g., self._logger.debug or the existing logger) including
context like model_name, device_id and the exception, while preserving the
behavior of not rethrowing so cached_session_id is only set when
ready_resp.found.
In `@modelexpress_client/python/tests/test_vllm_loader.py`:
- Around line 279-280: Replace the fragile dual-access assertion with a single,
clear check using the mock's call_args tuple unpacking or assert_called_with:
capture the kwargs via "_, kwargs = mx_client.publish_ready.call_args" (or
directly use
"mx_client.publish_ready.assert_called_with(model_name='test-model')" ) and
assert "kwargs['model_name'] == 'test-model'"; this targets the publish_ready
mock reliably and removes the brittle indexing approach.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (27)
.cursor/rules/rust.mdc.dockerignore.github/copilot-instructions.md.github/dco.yml.github/workflows/ci.ymlCLAUDE.mdCODE_OF_CONDUCT.mdCONTRIBUTING.mdREADME.mdSECURITY.mddocs/ARCHITECTURE.mddocs/DEPLOYMENT.mdexamples/p2p_transfer_k8s/README.mdexamples/p2p_transfer_k8s/deploy/persistence/vllm-target-redis.yamlexamples/p2p_transfer_k8s/deploy/vllm-target.yamlexamples/p2p_transfer_k8s/deploy/vllm.yamlexamples/p2p_transfer_k8s/model-download.yamlhelm/README.mdhelm/deploy.shmodelexpress_client/python/modelexpress/__init__.pymodelexpress_client/python/modelexpress/types.pymodelexpress_client/python/modelexpress/vllm_loader.pymodelexpress_client/python/modelexpress/vllm_worker.pymodelexpress_client/python/tests/__init__.pymodelexpress_client/python/tests/conftest.pymodelexpress_client/python/tests/test_vllm_loader.pymodelexpress_common/proto/p2p.proto
💤 Files with no reviewable changes (3)
- modelexpress_client/python/tests/init.py
- examples/p2p_transfer_k8s/deploy/vllm-target.yaml
- modelexpress_client/python/modelexpress/types.py
| # 2. kubectl apply -f model-download.yaml | ||
| # 3. Wait for job to complete: kubectl wait --for=condition=complete job/model-download --timeout=1h | ||
| # 4. Deploy vllm-source.yaml and vllm-target.yaml | ||
| # 4. Deploy deploy/vllm.yaml |
There was a problem hiding this comment.
Add the Redis flush redeploy step to usage instructions.
Line 10 updates deploy guidance, but the redeploy safety step is still missing. Please add a Redis flush note before redeploy flows to avoid stale metadata.
Suggested doc patch
-# 4. Deploy deploy/vllm.yaml
+# 4. If redeploying vLLM, flush Redis first:
+# kubectl exec -it <redis-pod> -- redis-cli FLUSHALL
+# 5. Deploy deploy/vllm.yamlAs per coding guidelines "When deploying to Kubernetes, flush Redis with redis-cli FLUSHALL before deleting and redeploying vLLM instances to clear stale metadata".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # 4. Deploy deploy/vllm.yaml | |
| # 4. If redeploying vLLM, flush Redis first: | |
| # kubectl exec -it <redis-pod> -- redis-cli FLUSHALL | |
| # 5. Deploy deploy/vllm.yaml |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/p2p_transfer_k8s/model-download.yaml` at line 10, Update the
usage/deploy instructions to include a Redis flush safety step before
redeploying vLLM: add a note immediately after the "Deploy deploy/vllm.yaml"
guidance (the comment line "# 4. Deploy deploy/vllm.yaml") instructing
operators to run redis-cli FLUSHALL to clear Redis metadata before
deleting/redeploying vLLM instances; ensure the note clearly states this is
required to avoid stale metadata during redeploy flows.
Replace the two separate loaders (mx-source, mx-target) with a single
MxModelLoader registered as --load-format mx. It auto-detects whether to
load model weights from disk or receive them via RDMA from an existing
source, eliminating the need to manually label nodes.
Add MxModelLoader with one-shot _detect_source() check against MX
server: if ready source exists, receive via RDMA; otherwise load from
disk. Both paths register with NIXL and publish metadata so future
nodes can discover this one
Remove MxSourceModelLoader and MxTargetModelLoader classes
Extract shared helpers (_collect_cuda_tensors, _init_nixl_manager,
_log_tensor_summary, _publish_metadata_and_ready) from duplicated code
Delete vllm-target.yaml, rename vllm-source.yaml to vllm.yaml with
--load-format mx
Add unit tests for shared helpers and detection logic
Update documentation (ARCHITECTURE.md, DEPLOYMENT.md, CONTRIBUTING.md,
K8s README)
PR built on top of docs: overhaul AI instructions and extract reference documentation #146 to avoid documentation drift.
Summary by CodeRabbit
Release Notes
Documentation
New Features
Chores