High-throughput, zero-copy GPU-to-GPU weight transfer for distributed reinforcement learning.
Authors: John Drab, Nathan Kotni, Neha Pradeep, Sakthi Karimanal, Vrushali Harane
This repository contains our implementation of asynchronous RL weight broadcasting using UCCL RDMA, comparing:
- UCCL Collectives
- NCCL Collectives (baseline)
- UCCL P2P (one-sided RDMA writes)
We design a fully distributed reinforcement learning pipeline that integrates RDMA weight transfer, hierarchical collectives, and a routing-table–based load balancing algorithm to achieve 40+ GB/s GPU-to-GPU transfer throughput across multi-node, multi-GPU clusters.
Modern reinforcement learning systems require frequent, asymmetric, one-to-many weight broadcasts from a central learner to many rollout workers. These transfers often exceed hundreds of gigabytes per update, and typical CPU-mediated or RPC-based methods introduce:
- extra memory copies
- high latency
- reduced throughput
- policy lag → slower convergence
RDMA enables direct GPU-to-GPU transfers with no CPU involvement.
UCCL P2P, in particular, supports asynchronous one-sided RDMA writes, making it a natural fit for RL workloads where worker nodes operate at different rates.
This project evaluates UCCL’s suitability for RL and introduces:
- A multi-node RL architecture integrating NCCL + UCCL
- A routing algorithm for optimal parameter distribution
- Zero-copy RDMA weight transfer
- A scalable RL train → transfer → inference pipeline
Direct writes into remote GPU memory using UCCL P2P.
- NCCL for fast intra-node collectives
- UCCL for inter-node RDMA collectives
- Gloo for rank discovery + metadata exchange
Load-balanced parameter-to-GPU mapping to maximize parallel transfer bandwidth.
Supports both DDP and FSDP across multi-node GPU clusters.
Sustains 40+ GB/s aggregate RDMA bandwidth.
Controller Node
├── Collects metadata from all workers
├── Computes routing table (parameter → sender GPU)
└── Broadcasts RDMA endpoints + routing table
Trainer Nodes (DDP/FSDP)
├── Perform PPO updates
├── Intra-node gradient sync via NCCL
└── RDMA-write updated weights to inference GPUs (UCCL P2P)
Inference Nodes
├── Receive weights via one-sided RDMA
└── Perform rollout generation
Control path: Gloo
Data path: UCCL RDMA
┌─────────────────────────────┐
│ Controller Node │
│ - Gathers metadata │
│ - Computes routing table │
│ - Distributes RDMA info │
└──────────────┬──────────────┘
│
▼
┌──────────────────────────────────────────────────────────────┐
│ Training Nodes │
│ (DDP/FSDP) │
│ ┌────────────────────────────────────────────────────────┐ │
│ │ - PPO updates │ │
│ │ - NCCL intra-node gradient AllReduce │ │
│ │ - UCCL P2P → RDMA-write weights to inference GPUs │ │
│ └────────────────────────────────────────────────────────┘ │
└───────────────────────────────┬──────────────────────────────┘
│ RDMA (one-sided)
▼
┌──────────────────────────────────────────────┐
│ Inference Nodes │
│ - Receive weights via one-sided RDMA │
│ - Perform rollouts │
└──────────────────────────────────────────────┘
- RDMA-capable NICs (RoCEv2)
- GPUs with GPUDirect RDMA
- PyTorch w/ NCCL
- UCCL built with RoCm
- Python ≥ 3.9
These variables configure both NCCL and UCCL for RDMA-based GPU networking.
# --- Networking interfaces (replace with your cluster NIC) ---
export NCCL_SOCKET_IFNAME="enp49s0f1np1"
export GLOO_SOCKET_IFNAME="enp49s0f1np1"
# --- NCCL RDMA settings ---
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA="^bnxt_re6"
export NCCL_IB_GID_INDEX=3
export NCCL_MIN_NCHANNELS=4 # Improves hybrid NCCL + UCCL performance
export NCCL_MAX_NCHANNELS=8
export NCCL_NSOCKS_PERTHREAD=2
export NCCL_SOCKET_NTHREADS=1
export NCCL_DEBUG=INFO
# --- UCCL RDMA settings ---
export UCCL_IB_GID_INDEX=3
# --- CPU threading (recommended for multi-node runs) ---
export OMP_NUM_THREADS=1
# --- Set to the IP of node 0 (master node) ---
export MASTER_ADDR=<MASTER_NODE_IP>Note: Replace
"enp49s0f1np1"with the NIC interface on your cluster.The
NCCL_MIN_NCHANNELSandNCCL_MAX_NCHANNELSsettings significantly speed up our hybrid NCCL + UCCL setup.
export MASTER_ADDR=<IP_OF_NODE_0>git clone https://github.com/uccl-project/uccl.git --recursive
cd uccl && bash build_and_install.sh rocm6 p2p 3.10source setup_rdma_env.shcd uccl_p2p_broadcast_apis/torchrun --nproc_per_node=8 --nnodes=4 --node_rank=$NODE_RANK --master_addr=$MASTER_ADDR --master_port=29501 uccl_p2p_apis_for_rl_weights_gpt2.py --num_training=24 --num_inference=8 --rl_iterations=10 --ppo_epochs=4 --rollouts_per_gpu=8torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$NODE_RANK \
--master_addr=$MASTER_ADDR --master_port=29501 \
uccl_p2p_apis_for_rl_weights_qwen2.5.py \
--model_name Qwen/Qwen2.5-0.5B \
--num_training=8 --num_inference=8 \
--dtype bfloat16| Trainers | Transfer Time | Aggregate BW |
|---|---|---|
| 8 | 0.055 s | 39.49 GB/s |
| 16 | 0.084 s | 38.27 GB/s |
| 24 | 0.078 s | 37.52 GB/s |
| 32 | 0.064 s | 27.48 GB/s |
- UCCL Collective ≈ NCCL in DDP/FSDP training.
- UCCL P2P scales linearly with sender count up to ~24 GPUs.
- Sustains 40+ GB/s aggregate GPU-to-GPU transfer throughput.
- Zero CPU involvement reduces policy lag and improves RL stability.
UCCL P2P uses a greedy byte-balanced routing strategy:
- Controller gathers parameter metadata
- For each
(parameter, inference_gpu)pair:- choose the training GPU with lowest cumulative assigned bytes
- Broadcast routing table to all GPUs
- Trainers perform RDMA writes with no runtime coordination
This ensures:
- maximal parallelism
- balanced NIC usage
- minimized makespan
- FSDP-aware RDMA transfers (DTensor integration)
- Multi-NIC, topology-aware routing
- Scaling to 70B+ parameter models
- End-to-end RL frameworks integration (e.g., RLlib/VLLM-RL)
- Inference-stage overlapping of computation + transfers
See our full paper for complete citations.
Key references include:
- Journey to 2-Second RL Weight Transfer
- UCCL transport layer (Zhou et al., 2025)