Skip to content

Conversation

@kip-cxj
Copy link
Contributor

@kip-cxj kip-cxj commented Dec 16, 2025

Motivation

Add stateless communication group. To enable more flexible creation of communication groups and resolve compatibility issues with other programs that also use the torch.distributed.
Current support vllm, while sglang does not yet supprt pyhccl. Which feature depends on add pyhccl in sglang.
If the current approach in accptable, we will provide sglang version soon.

@kip-cxj kip-cxj force-pushed the main branch 2 times, most recently from 1b27b3f to f989a80 Compare December 17, 2025 07:12
@kip-cxj kip-cxj changed the title draft: add collective communication for npu draft: add stateless communication for npu Dec 30, 2025
@x1314aq
Copy link

x1314aq commented Jan 7, 2026

@weixiao-huang @HubertZhang pls review this PR

test both on npu and cuda.

Model Device Info device_type GatherMetas Update (Broadcast) Update (P2P)
Qwen3-8b 8xNvidia-A100 TP4 cuda 0.01s 1.28s (1.46GiB) 7.81s (1.72GiB)
Qwen3-8b 8xAscend-A3 TP4 npu 0.02s 1.37s (1.59GiB) 2.02s (1.47GiB)

test the same model using default torch.distributed module.

Model Device Info device_type GatherMetas Update (Broadcast) Update (P2P)
Qwen3-8b 8xNvidia-A100 TP4 torch 0.01s 1.15s (1.46GiB) 7.68s (1.71GiB)
Qwen3-8b 8xAscend-A3 TP4 torch 0.03s 1.44s (1.59GiB) 3.83s (1.46GiB)

@kip-cxj kip-cxj changed the title draft: add stateless communication for npu feat: Replace torch.distributed with StatelessProcessGroup Jan 8, 2026
@kip-cxj kip-cxj changed the title feat: Replace torch.distributed with StatelessProcessGroup feat: add StatelessProcessGroup to extend collective library Jan 8, 2026
@weixiao-huang
Copy link
Collaborator

It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think ps.py should be a lightweight component, which may not depend on other heavy framework

@hanhan-networking
Copy link

It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think ps.py should be a lightweight component, which may not depend on other heavy framework

默认的还是通信方式还是torch.distributed诶,只有需要跨资源的时候才需要用到StatelessProcessGroup,如果不支持这个的话,没法合入到verl呀😆 ,不支持训推分离的架构

@HubertZhang
Copy link
Collaborator

是否应当设计一个 protocol DistrubutedLib,给 ps 传入一个 dist: DistributedLib 比较好一些?目前这个 import 的写法感觉隔离的还不太够?

@x1314aq
Copy link

x1314aq commented Jan 12, 2026

是否应当设计一个 protocol DistrubutedLib,给 ps 传入一个 dist: DistributedLib 比较好一些?目前这个 import 的写法感觉隔离的还不太够?

新增了一个dist_wrapper.py文件,把处理dist的逻辑放进去了,这样ps.pyupdate.py就可以通过from dist_wrapper import dist直接使用dist模块,不用关心具体实现

@x1314aq
Copy link

x1314aq commented Jan 12, 2026

It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think ps.py should be a lightweight component, which may not depend on other heavy framework

In most cases, the logic remains consistent with that before. Only need to depend on vLLM when the cutom distribued module is required. It does not change the way that checkpoint-engien is a lightweight component.

@x1314aq
Copy link

x1314aq commented Jan 13, 2026

是否应当设计一个 protocol DistrubutedLib,给 ps 传入一个 dist: DistributedLib 比较好一些?目前这个 import 的写法感觉隔离的还不太够?

新增了一个dist_wrapper.py文件,把处理dist的逻辑放进去了,这样ps.pyupdate.py就可以通过from dist_wrapper import dist直接使用dist模块,不用关心具体实现

本地试了下用dist_wrapper.py包装有问题,现在改成在distributed/base.py里面把torch.distributed包装进去。用法上只需要把import torch.distributed as dist替换成import checkpoint_engine.distributed as dist

# import torch.distributed as dist
import checkpoint_engine.distributed as dist

dist.init_process_group()
dist.all_reduce()
dist.xxxx()

如果需要使用custom distributed模块的话,只需要给ps传一个custom_dist=True就行。

@x1314aq
Copy link

x1314aq commented Jan 15, 2026

@weixiao-huang @HubertZhang

如果没有其他review意见的话,能否合入下?

If no more comments, can this be merged?

@HubertZhang
Copy link
Collaborator

HubertZhang commented Jan 15, 2026

话说要不要抽象 StatelessProcessGroup 而非 dist 呢,在 ps 中直接使用封装好的高级方法看起来方便很多?想象中 sub group 的部分可能会复杂一点但是其他的地方应当简单很多?

# dist/vllm.py
from vllm.distributed import StatelessProcessGroup
VLLMStatelessProcessGroup = StatelessProcessGroup

# ps.py
class ParameterServer:
    def __init__(self, grouo):
        self.group = group
        ...
    def gather(self):
        self.group.broadcast(self.metas)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants