Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,7 @@ models.db
**/__pycache__/
docs/FEEDBACK.md
docs/optimization-comparison.md
docs/tmp

# Python
modelexpress_client/python/modelexpress.egg-info
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ Cache directory resolution order: `MODEL_EXPRESS_CACHE_DIRECTORY` -> `HF_HUB_CAC
| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_EXPRESS_URL` | `localhost:8001` | gRPC server address |
| `MX_REGISTER_LOADERS` | `1` | Auto-register mx-source/mx-target vLLM loaders |
| `MX_REGISTER_LOADERS` | `1` | Auto-register the mx vLLM loader |
| `MX_CONTIGUOUS_REG` | `0` | Enable contiguous region registration (experimental) |
| `MX_EXPECTED_WORKERS` | `8` | Number of GPU workers to wait for |
| `MX_SYNC_PUBLISH` | `1` | Source: wait for all workers before publishing |
Expand Down
39 changes: 16 additions & 23 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,11 +33,11 @@ graph TD
end

subgraph "P2P Transfer Mode"
subgraph "Node A - Source"
A[vLLM + MxSourceModelLoader]
subgraph "Node A"
A[vLLM + MxModelLoader]
end
subgraph "Node B - Target"
B[vLLM + MxTargetModelLoader]
subgraph "Node B"
B[vLLM + MxModelLoader]
end
A -->|gRPC metadata| S2[ModelExpress Server]
B -->|gRPC metadata| S2
Expand Down Expand Up @@ -108,7 +108,7 @@ ModelExpress/
│ ├── __init__.py # Package init, vLLM loader auto-registration
│ ├── client.py # MxClient gRPC client
│ ├── nixl_transfer.py # NixlTransferManager
│ ├── vllm_loader.py # MxSourceModelLoader, MxTargetModelLoader
│ ├── vllm_loader.py # MxModelLoader
│ ├── vllm_worker.py # ModelExpressWorker (custom vLLM worker)
│ ├── types.py # TensorDescriptor, WorkerMetadata dataclasses
│ ├── p2p_pb2.py # Generated protobuf stubs
Expand Down Expand Up @@ -158,8 +158,7 @@ ModelExpress/
│ ├── p2p_transfer_k8s/ # GPU-to-GPU weight transfer example
│ │ ├── README.md
│ │ ├── Dockerfile.client # vLLM + ModelExpress integration
│ │ ├── vllm-source.yaml
│ │ ├── vllm-target.yaml
│ │ ├── vllm.yaml
│ │ ├── modelexpress-server.yaml # gRPC server + Redis sidecar
│ │ └── model-download.yaml
│ └── aggregated_k8s/ # Dynamo integration example
Expand Down Expand Up @@ -406,10 +405,10 @@ Loading precedence: CLI args > environment variables > config file > defaults.

| Module | Purpose |
|--------|---------|
| `__init__.py` | Package init, exports `register_modelexpress_loaders()` for callers to register `mx-source`/`mx-target` loaders with vLLM |
| `__init__.py` | Package init, exports `register_modelexpress_loaders()` for callers to register the `mx` loader with vLLM |
| `client.py` | `MxClient` - gRPC client with session management, ready polling |
| `nixl_transfer.py` | `NixlTransferManager` - NIXL agent lifecycle, tensor registration, RDMA transfers |
| `vllm_loader.py` | `MxSourceModelLoader`, `MxTargetModelLoader` - custom vLLM model loaders |
| `vllm_loader.py` | `MxModelLoader` - custom vLLM model loader |
| `vllm_worker.py` | `ModelExpressWorker` - custom vLLM worker class (use `--worker-cls=modelexpress.vllm_worker.ModelExpressWorker`) |
| `types.py` | `TensorDescriptor`, `WorkerMetadata`, `GetMetadataResponse` dataclasses |
| `p2p_pb2.py` / `p2p_pb2_grpc.py` | Generated protobuf/gRPC stubs |
Expand Down Expand Up @@ -439,22 +438,17 @@ Manages a NIXL agent and RDMA transfers for a single GPU worker:
| `receive_from_source(source_metadata, source_tensors, ...)` | Execute RDMA read transfer with optional coalescing |
| `shutdown()` | Clean up NIXL agent and resources |

### vLLM Loaders
### vLLM Loader

**MxSourceModelLoader** (extends `DefaultModelLoader`):
1. Load weights from disk
2. Register raw tensors with NIXL (BEFORE FP8 processing)
3. Publish metadata to ModelExpress server
**MxModelLoader** (extends `BaseModelLoader`):
1. Query MX server to detect whether a ready source exists for this model/rank
2. If source found: create dummy tensors, receive via RDMA, register + publish
3. If no source: load weights from disk, register + publish
4. Run `process_weights_after_loading()` (FP8 transform)
5. Signal readiness after warmup

**MxTargetModelLoader** (extends `DummyModelLoader`):
1. Create dummy tensors matching source layout
2. Wait for source readiness (via `MxClient.wait_for_ready()`)
3. Receive weights via RDMA transfer (with transfer-time coalescing of contiguous regions)
4. Run `process_weights_after_loading()` (same FP8 transform)
Detection is a one-shot check with no retry loop. If the source is still warming up, this node loads from disk and becomes a source itself. Both paths register with NIXL and publish metadata, so future nodes can discover this one.

Both loaders require the `MODEL_NAME` environment variable to identify the model for coordination.
The loader reads the model name from vLLM's `model_config.model` (set via the `--model` CLI argument). Shared helpers (`_collect_cuda_tensors`, `_init_nixl_manager`, `_log_tensor_summary`, `_publish_metadata_and_ready`) are module-level functions used by the loader.

Module-level globals `_raw_tensor_registry` and `_nixl_managers` in `vllm_loader.py` bridge loaders and clients - vLLM's loader API doesn't expose loader instances after `load_model()` returns, so source loaders store state in these dicts (keyed by device ID) for the MxClient to access.

Expand Down Expand Up @@ -550,8 +544,7 @@ graph TD

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_NAME` | (none) | Model identifier for P2P coordination (e.g., `deepseek-ai/DeepSeek-V3`) |
| `MX_REGISTER_LOADERS` | `1` | Auto-register mx-source/mx-target loaders with vLLM |
| `MX_REGISTER_LOADERS` | `1` | Auto-register the mx loader with vLLM |
| `MODEL_EXPRESS_URL` | `localhost:8001` | gRPC server address |
| `MX_SERVER_ADDRESS` | `localhost:8001` | Backward-compat alias for `MODEL_EXPRESS_URL` |
| `MX_CONTIGUOUS_REG` | `0` | Enable contiguous region registration (experimental) |
Expand Down
25 changes: 11 additions & 14 deletions docs/DEPLOYMENT.md
Original file line number Diff line number Diff line change
Expand Up @@ -219,17 +219,16 @@ See [`../examples/aggregated_k8s/README.md`](../examples/aggregated_k8s/README.m

## P2P GPU Weight Transfers

ModelExpress supports GPU-to-GPU model weight transfers between vLLM instances using NVIDIA NIXL over RDMA. One "source" vLLM instance loads model weights from disk and transfers them directly to "target" instances via GPU memory.
ModelExpress supports GPU-to-GPU model weight transfers between vLLM instances using NVIDIA NIXL over RDMA. Use `--load-format mx`, which auto-detects whether to load from disk or receive via RDMA.

### P2P Environment Variables

| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_NAME` | (none) | Model identifier for P2P coordination (e.g., `deepseek-ai/DeepSeek-V3`) |
| `REDIS_URL` | `redis://localhost:6379` | Redis connection URL for P2P state storage |
| `MODEL_EXPRESS_URL` | `localhost:8001` | gRPC server address |
| `MX_SERVER_ADDRESS` | `localhost:8001` | Backward-compat alias for `MODEL_EXPRESS_URL` |
| `MX_REGISTER_LOADERS` | `1` | Auto-register mx-source/mx-target loaders with vLLM |
| `MX_REGISTER_LOADERS` | `1` | Auto-register the mx loader with vLLM |
| `MX_CONTIGUOUS_REG` | `0` | Contiguous region registration (experimental) |
| `MX_EXPECTED_WORKERS` | `8` | Number of GPU workers to wait for |
| `MX_SYNC_PUBLISH` | `1` | Source: wait for all workers before publishing |
Expand All @@ -254,20 +253,19 @@ vLLM instances must use the custom worker class for loader registration in spawn

### P2P Kubernetes Deployment

Deploy multiple identical instances - the first one loads from disk and subsequent ones receive via RDMA:

```bash
NAMESPACE=my-namespace

# Deploy server + Redis
kubectl -n $NAMESPACE apply -f examples/p2p_transfer_k8s/modelexpress-server.yaml

# Deploy source vLLM instance (loads real weights)
kubectl -n $NAMESPACE apply -f examples/p2p_transfer_k8s/vllm-source.yaml

# Deploy target vLLM instance (receives via RDMA)
kubectl -n $NAMESPACE apply -f examples/p2p_transfer_k8s/vllm-target.yaml
# Deploy vLLM instances with mx loader
kubectl -n $NAMESPACE apply -f examples/p2p_transfer_k8s/vllm.yaml

# Monitor
watch kubectl -n $NAMESPACE get pods -l 'app in (mx-source, mx-target)'
watch kubectl -n $NAMESPACE get pods -l app=mx-vllm
```

See [`../examples/p2p_transfer_k8s/README.md`](../examples/p2p_transfer_k8s/README.md) for the full P2P transfer guide including architecture, prerequisites, and performance expectations.
Expand All @@ -278,18 +276,17 @@ See [`../examples/p2p_transfer_k8s/README.md`](../examples/p2p_transfer_k8s/READ
# Stream server logs
kubectl -n $NAMESPACE logs -f deploy/modelexpress-server

# Stream source/target logs
kubectl -n $NAMESPACE logs -f deploy/mx-source
kubectl -n $NAMESPACE logs -f deploy/mx-target
# Stream vLLM instance logs
kubectl -n $NAMESPACE logs -f deploy/mx-vllm

# Check Redis state (P2P metadata)
kubectl -n $NAMESPACE exec deploy/modelexpress-server -c redis -- redis-cli KEYS '*'

# Flush Redis (clear stale metadata - do this on redeploy)
kubectl -n $NAMESPACE exec deploy/modelexpress-server -c redis -- redis-cli FLUSHALL

# Test inference on target
kubectl -n $NAMESPACE exec deploy/mx-target -- curl -s http://localhost:8000/v1/completions \
# Test inference
kubectl -n $NAMESPACE exec deploy/mx-vllm -- curl -s http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-ai/DeepSeek-V3", "prompt": "Hello", "max_tokens": 10}'
```
Expand Down
16 changes: 1 addition & 15 deletions examples/p2p_transfer_k8s/Dockerfile.client
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

FROM vllm/vllm-openai:v0.12.0
FROM vllm/vllm-openai:v0.17.1

# Copy and install ModelExpress Python client
# Proto stubs (p2p_pb2.py, p2p_pb2_grpc.py) are checked-in source artifacts;
Expand All @@ -12,19 +12,5 @@ RUN pip uninstall -y modelexpress 2>/dev/null || true && \
rm -rf "$(python3 -c 'import site; print(site.getsitepackages()[0])')"/modelexpress* && \
pip install .

# Patch vLLM's model loader __init__.py to include ModelExpress loaders
# This ensures the loaders are available in ALL processes (including workers)
RUN LOADER_INIT="$(python3 -c 'import site; print(site.getsitepackages()[0])')/vllm/model_executor/model_loader/__init__.py" && \
echo '' >> $LOADER_INIT && \
echo '# ModelExpress custom loaders for FP8 models (DeepSeek-V3, etc.)' >> $LOADER_INIT && \
echo 'import os as _os' >> $LOADER_INIT && \
echo 'if _os.environ.get("MX_REGISTER_LOADERS", "0") == "1":' >> $LOADER_INIT && \
echo ' try:' >> $LOADER_INIT && \
echo ' from modelexpress.vllm_loader import MxSourceModelLoader, MxTargetModelLoader' >> $LOADER_INIT && \
echo ' _LOAD_FORMAT_TO_MODEL_LOADER["mx-source"] = MxSourceModelLoader' >> $LOADER_INIT && \
echo ' _LOAD_FORMAT_TO_MODEL_LOADER["mx-target"] = MxTargetModelLoader' >> $LOADER_INIT && \
echo ' except ImportError:' >> $LOADER_INIT && \
echo ' pass' >> $LOADER_INIT

# Back to default workdir
WORKDIR /workspace
53 changes: 16 additions & 37 deletions examples/p2p_transfer_k8s/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ This example demonstrates how to set up ModelExpress for P2P GPU weight transfer

```mermaid
graph TD
subgraph "Node A - Source"
A[vLLM + MxSourceModelLoader<br/>- Loads weights from disk<br/>- Registers tensors with NIXL<br/>- Publishes metadata via MxClient<br/>- Publishes ready flag]
subgraph "Node A"
A[vLLM + MxModelLoader<br/>- Loads weights from disk<br/>- Registers tensors with NIXL<br/>- Publishes metadata via MxClient<br/>- Publishes ready flag]
end
subgraph "Node B - Target"
B[vLLM + MxTargetModelLoader<br/>- Starts with dummy weights<br/>- Waits for source ready flag<br/>- Receives weights via NIXL<br/>- Runs FP8 processing<br/>- Serves inference]
subgraph "Node B"
B[vLLM + MxModelLoader<br/>- Detects source via MX server<br/>- Receives weights via NIXL<br/>- Runs FP8 processing<br/>- Serves inference]
end
A -- "RDMA via NIXL" --> B
A --> S
Expand All @@ -20,7 +20,7 @@ graph TD

### Key Design Points

1. **Custom vLLM Loaders**: NIXL transfer logic runs inside vLLM via `--load-format mx-source` / `--load-format mx-target`
1. **Custom vLLM Loader**: NIXL transfer logic runs inside vLLM via `--load-format mx`
2. **MxClient**: All gRPC communication goes through `MxClient` (workers never access Redis directly)
3. **FP8 Support**: Raw tensors (including `weight_scale_inv`) transfer BEFORE FP8 processing
4. **Tensor Parallelism**: Full TP support with rank-matched transfers (one NIXL agent per GPU)
Expand All @@ -43,40 +43,20 @@ kubectl create secret generic hf-token-secret --from-literal=HF_TOKEN=<your-toke
### 2. Deploy ModelExpress Server

```bash
kubectl apply -f modelexpress-server.yaml
kubectl apply -f deploy/modelexpress-server.yaml
```

### 3. Deploy Source vLLM Instance
### 3. Deploy vLLM Instances

This instance loads real weights from HuggingFace and becomes the source:
Deploy identical instances - the first loads from disk, subsequent ones receive via RDMA:

```bash
kubectl apply -f vllm-source.yaml
kubectl apply -f deploy/vllm.yaml
# Scale up for more replicas
kubectl scale deployment/mx-vllm --replicas=2
```

Wait for it to be ready and client to publish metadata:

```bash
kubectl logs deployment/mx-source -c client -f
```

### 4. Deploy Target vLLM Instance

This instance starts with dummy weights and receives real weights via P2P:

```bash
kubectl apply -f vllm-target.yaml
```

The client will automatically:
1. Wait for vLLM ZMQ sockets to be ready
2. Query ModelExpress server for the model
3. Find source metadata and receive weights via NIXL RDMA
4. Publish its own metadata (becomes another source)

```bash
kubectl logs deployment/mx-target -c client -f
```
The `mx` loader checks the MX server on startup. If a ready source exists, it receives via RDMA. Otherwise it loads from disk and becomes a source for future nodes.

## Environment Variables

Expand Down Expand Up @@ -106,14 +86,13 @@ With TP=4, this creates sockets: `/tmp/mx/vllm-0.sock`, `/tmp/mx/vllm-1.sock`, e
### Check ZMQ sockets are created

```bash
kubectl exec -it deployment/mx-source -c vllm -- ls -la /tmp/mx/
kubectl exec -it deployment/mx-vllm -c vllm -- ls -la /tmp/mx/
```

### Check client logs

```bash
kubectl logs deployment/mx-source -c client
kubectl logs deployment/mx-target -c client
kubectl logs deployment/mx-vllm -c client
```

### Check Redis connectivity
Expand All @@ -125,13 +104,13 @@ kubectl exec -it deployment/modelexpress-server -- redis-cli -h modelexpress-red
### Verify InfiniBand is working

```bash
kubectl exec -it deployment/mx-source -c client -- ibstat
kubectl exec -it deployment/mx-vllm -c client -- ibstat
```

### Check UCX configuration

```bash
kubectl exec -it deployment/mx-source -c client -- ucx_info -d
kubectl exec -it deployment/mx-vllm -c client -- ucx_info -d
```

## License
Expand Down
7 changes: 5 additions & 2 deletions examples/p2p_transfer_k8s/deploy/modelexpress-server.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ spec:
spec:
containers:
- name: modelexpress-server
image: nvcr.io/nvidian/dynamo-dev/modelexpress-server:k8s-backend
image: nvcr.io/nvidian/dynamo-dev/modelexpress-server:latest
imagePullPolicy: Always
ports:
- containerPort: 8001
Expand All @@ -46,7 +46,10 @@ spec:
value: "memory"
resources:
requests:
cpu: 4
cpu: "4"
memory: 8Gi
limits:
cpu: "4"
memory: 8Gi
livenessProbe:
tcpSocket:
Expand Down
12 changes: 7 additions & 5 deletions examples/p2p_transfer_k8s/deploy/persistence/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,8 @@ so it recovers state without sources needing to re-publish.
| `redis-standalone.yaml` | Standalone Redis deployment with PVC persistence |
| `crd-modelmetadata.yaml` | Custom Resource Definition for model metadata |
| `rbac-modelmetadata.yaml` | RBAC roles for CRD access |
| `vllm-source-redis.yaml` | vLLM source configured for Redis backend |
| `vllm-target-redis.yaml` | vLLM target configured for Redis backend |
| `vllm-redis.yaml` | vLLM instance configured for Redis backend |
| `vllm-kubernetes.yaml` | vLLM instance configured for Kubernetes CRD backend |

## Usage

Expand All @@ -59,9 +59,8 @@ kubectl apply -f persistence/redis-standalone.yaml
# 2. Deploy MX server with Redis write-through
kubectl apply -f persistence/modelexpress-server-redis.yaml

# 3. Deploy vLLM source/target (use Redis-aware YAMLs)
kubectl apply -f persistence/vllm-source-redis.yaml
kubectl apply -f persistence/vllm-target-redis.yaml
# 3. Deploy vLLM instances (mx loader auto-detects source/target role)
kubectl apply -f persistence/vllm-redis.yaml
```

Set `MX_METADATA_BACKEND=redis` and `REDIS_URL=redis://redis:6379` on the server.
Expand All @@ -75,6 +74,9 @@ kubectl apply -f persistence/rbac-modelmetadata.yaml

# 2. Deploy MX server with K8s CRD backend
kubectl apply -f persistence/modelexpress-server-kubernetes.yaml

# 3. Deploy vLLM instances (mx loader auto-detects source/target role)
kubectl apply -f persistence/vllm-kubernetes.yaml
```

Set `MX_METADATA_BACKEND=kubernetes` on the server. Requires a ServiceAccount with
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -41,8 +41,8 @@ spec:
serviceAccountName: modelexpress
containers:
- name: modelexpress-server
image: nvcr.io/nvidian/dynamo-dev/modelexpress-server:k8s-backend
imagePullPolicy: Always
image: nvcr.io/nvidian/dynamo-dev/modelexpress-server:latest
Comment thread
AndyDai-nv marked this conversation as resolved.
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8001
env:
Expand All @@ -54,7 +54,10 @@ spec:
fieldPath: metadata.namespace
resources:
requests:
cpu: 4
cpu: "4"
memory: 8Gi
limits:
cpu: "4"
memory: 8Gi
livenessProbe:
tcpSocket:
Expand Down
Loading
Loading