RL Weight Broadcasting via UCCL RDMA (P2P + Collectives)

High-throughput, zero-copy GPU-to-GPU weight transfer for distributed reinforcement learning.
Authors: John Drab, Nathan Kotni, Neha Pradeep, Sakthi Karimanal, Vrushali Harane

This repository contains our implementation of asynchronous RL weight broadcasting using UCCL RDMA, comparing:

UCCL Collectives
NCCL Collectives (baseline)
UCCL P2P (one-sided RDMA writes)

We design a fully distributed reinforcement learning pipeline that integrates RDMA weight transfer, hierarchical collectives, and a routing-table–based load balancing algorithm to achieve 40+ GB/s GPU-to-GPU transfer throughput across multi-node, multi-GPU clusters.

Motivation

Modern reinforcement learning systems require frequent, asymmetric, one-to-many weight broadcasts from a central learner to many rollout workers. These transfers often exceed hundreds of gigabytes per update, and typical CPU-mediated or RPC-based methods introduce:

extra memory copies
high latency
reduced throughput
policy lag → slower convergence

RDMA enables direct GPU-to-GPU transfers with no CPU involvement.

UCCL P2P, in particular, supports asynchronous one-sided RDMA writes, making it a natural fit for RL workloads where worker nodes operate at different rates.

This project evaluates UCCL’s suitability for RL and introduces:

A multi-node RL architecture integrating NCCL + UCCL
A routing algorithm for optimal parameter distribution
Zero-copy RDMA weight transfer
A scalable RL train → transfer → inference pipeline

Features

Zero-copy GPU-to-GPU RDMA weight transfer

Direct writes into remote GPU memory using UCCL P2P.

Hierarchical communication design

NCCL for fast intra-node collectives
UCCL for inter-node RDMA collectives
Gloo for rank discovery + metadata exchange

Routing Table Algorithm

Load-balanced parameter-to-GPU mapping to maximize parallel transfer bandwidth.

Complete RL pipeline

Supports both DDP and FSDP across multi-node GPU clusters.

Demonstrated scalability

Sustains 40+ GB/s aggregate RDMA bandwidth.

System Architecture

Controller Node
    ├── Collects metadata from all workers
    ├── Computes routing table (parameter → sender GPU)
    └── Broadcasts RDMA endpoints + routing table

Trainer Nodes (DDP/FSDP)
    ├── Perform PPO updates
    ├── Intra-node gradient sync via NCCL
    └── RDMA-write updated weights to inference GPUs (UCCL P2P)

Inference Nodes
    ├── Receive weights via one-sided RDMA
    └── Perform rollout generation

Control path: Gloo
Data path: UCCL RDMA

Architecture Diagram

                        ┌─────────────────────────────┐
                        │        Controller Node      │
                        │  - Gathers metadata         │
                        │  - Computes routing table   │
                        │  - Distributes RDMA info    │
                        └──────────────┬──────────────┘
                                       │
                                       ▼
        ┌──────────────────────────────────────────────────────────────┐
        │                        Training Nodes                        │
        │                          (DDP/FSDP)                          │
        │   ┌────────────────────────────────────────────────────────┐ │
        │   │  - PPO updates                                         │ │
        │   │  - NCCL intra-node gradient AllReduce                  │ │
        │   │  - UCCL P2P → RDMA-write weights to inference GPUs     │ │
        │   └────────────────────────────────────────────────────────┘ │
        └───────────────────────────────┬──────────────────────────────┘
                                        │ RDMA (one-sided)
                                        ▼
                ┌──────────────────────────────────────────────┐
                │             Inference Nodes                  │
                │  - Receive weights via one-sided RDMA        │
                │  - Perform rollouts                          │
                └──────────────────────────────────────────────┘

Installation

Requirements

RDMA-capable NICs (RoCEv2)
GPUs with GPUDirect RDMA
PyTorch w/ NCCL
UCCL built with RoCm
Python ≥ 3.9

Required Environment Variables

These variables configure both NCCL and UCCL for RDMA-based GPU networking.

# --- Networking interfaces (replace with your cluster NIC) ---
export NCCL_SOCKET_IFNAME="enp49s0f1np1"
export GLOO_SOCKET_IFNAME="enp49s0f1np1"

# --- NCCL RDMA settings ---
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA="^bnxt_re6"
export NCCL_IB_GID_INDEX=3
export NCCL_MIN_NCHANNELS=4      # Improves hybrid NCCL + UCCL performance
export NCCL_MAX_NCHANNELS=8
export NCCL_NSOCKS_PERTHREAD=2
export NCCL_SOCKET_NTHREADS=1
export NCCL_DEBUG=INFO

# --- UCCL RDMA settings ---
export UCCL_IB_GID_INDEX=3

# --- CPU threading (recommended for multi-node runs) ---
export OMP_NUM_THREADS=1

# --- Set to the IP of node 0 (master node) ---
export MASTER_ADDR=<MASTER_NODE_IP>

Note: Replace "enp49s0f1np1" with the NIC interface on your cluster.

The NCCL_MIN_NCHANNELS and NCCL_MAX_NCHANNELS settings significantly speed up our hybrid NCCL + UCCL setup.

Running the System

export MASTER_ADDR=<IP_OF_NODE_0>

1. Take git clone of UCCL Repo and build it

git clone https://github.com/uccl-project/uccl.git --recursive
cd uccl && bash build_and_install.sh rocm6 p2p 3.10

2. Source the RDMA environment

source setup_rdma_env.sh

3. Enter the directory depending upon p2p/collective

cd uccl_p2p_broadcast_apis/

4. Execute commands for each node

torchrun --nproc_per_node=8 --nnodes=4 --node_rank=$NODE_RANK     --master_addr=$MASTER_ADDR --master_port=29501     uccl_p2p_apis_for_rl_weights_gpt2.py     --num_training=24 --num_inference=8     --rl_iterations=10 --ppo_epochs=4 --rollouts_per_gpu=8

torchrun --nproc_per_node=8 --nnodes=2 --node_rank=$NODE_RANK \
    --master_addr=$MASTER_ADDR --master_port=29501 \
  uccl_p2p_apis_for_rl_weights_qwen2.5.py \
    --model_name Qwen/Qwen2.5-0.5B \
    --num_training=8 --num_inference=8 \
    --dtype bfloat16

Results Summary

UCCL P2P Bandwidth (Qwen-2.5 Model)

Trainers	Transfer Time	Aggregate BW
8	0.055 s	39.49 GB/s
16	0.084 s	38.27 GB/s
24	0.078 s	37.52 GB/s
32	0.064 s	27.48 GB/s

Key Findings

UCCL Collective ≈ NCCL in DDP/FSDP training.
UCCL P2P scales linearly with sender count up to ~24 GPUs.
Sustains 40+ GB/s aggregate GPU-to-GPU transfer throughput.
Zero CPU involvement reduces policy lag and improves RL stability.

Routing Table Algorithm

UCCL P2P uses a greedy byte-balanced routing strategy:

Controller gathers parameter metadata
For each (parameter, inference_gpu) pair:
- choose the training GPU with lowest cumulative assigned bytes
Broadcast routing table to all GPUs
Trainers perform RDMA writes with no runtime coordination

This ensures:

maximal parallelism
balanced NIC usage
minimized makespan

Future Work

FSDP-aware RDMA transfers (DTensor integration)
Multi-NIC, topology-aware routing
Scaling to 70B+ parameter models
End-to-end RL frameworks integration (e.g., RLlib/VLLM-RL)
Inference-stage overlapping of computation + transfers

References

See our full paper for complete citations.
Key references include:

Journey to 2-Second RL Weight Transfer
UCCL transport layer (Zhou et al., 2025)

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
debuging_helpers		debuging_helpers
initial_experiments		initial_experiments
uccl_collective_apis		uccl_collective_apis
uccl_p2p_broadcast_apis		uccl_p2p_broadcast_apis
.DS_Store		.DS_Store
README.md		README.md
env.yml		env.yml
setup_rdma_env.sh		setup_rdma_env.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RL Weight Broadcasting via UCCL RDMA (P2P + Collectives)

Motivation

Features

Zero-copy GPU-to-GPU RDMA weight transfer

Hierarchical communication design

Routing Table Algorithm

Complete RL pipeline

Demonstrated scalability

System Architecture

Architecture Diagram

Installation

Requirements

Required Environment Variables

Running the System

1. Take git clone of UCCL Repo and build it

2. Source the RDMA environment

3. Enter the directory depending upon p2p/collective

4. Execute commands for each node

Results Summary

UCCL P2P Bandwidth (Qwen-2.5 Model)

Key Findings

Routing Table Algorithm

Future Work

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RL Weight Broadcasting via UCCL RDMA (P2P + Collectives)

Motivation

Features

Zero-copy GPU-to-GPU RDMA weight transfer

Hierarchical communication design

Routing Table Algorithm

Complete RL pipeline

Demonstrated scalability

System Architecture

Architecture Diagram

Installation

Requirements

Required Environment Variables

Running the System

1. Take git clone of UCCL Repo and build it

2. Source the RDMA environment

3. Enter the directory depending upon p2p/collective

4. Execute commands for each node

Results Summary

UCCL P2P Bandwidth (Qwen-2.5 Model)

Key Findings

Routing Table Algorithm

Future Work

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages