feat: add StatelessProcessGroup to extend collective library #66

kip-cxj · 2025-12-16T11:55:41Z

Motivation

Add stateless communication group. To enable more flexible creation of communication groups and resolve compatibility issues with other programs that also use the torch.distributed.
Current support vllm, while sglang does not yet supprt pyhccl. Which feature depends on add pyhccl in sglang.
If the current approach in accptable, we will provide sglang version soon.

x1314aq · 2026-01-07T10:27:49Z

@weixiao-huang @HubertZhang pls review this PR

test both on npu and cuda.

Model	Device Info	device_type	GatherMetas	Update (Broadcast)	Update (P2P)
Qwen3-8b	8xNvidia-A100 TP4	cuda	0.01s	1.28s (1.46GiB)	7.81s (1.72GiB)
Qwen3-8b	8xAscend-A3 TP4	npu	0.02s	1.37s (1.59GiB)	2.02s (1.47GiB)

test the same model using default torch.distributed module.

Model	Device Info	device_type	GatherMetas	Update (Broadcast)	Update (P2P)
Qwen3-8b	8xNvidia-A100 TP4	torch	0.01s	1.15s (1.46GiB)	7.68s (1.71GiB)
Qwen3-8b	8xAscend-A3 TP4	torch	0.03s	1.44s (1.59GiB)	3.83s (1.46GiB)

weixiao-huang · 2026-01-09T07:18:16Z

It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think ps.py should be a lightweight component, which may not depend on other heavy framework

hanhan-networking · 2026-01-09T08:45:12Z

It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think ps.py should be a lightweight component, which may not depend on other heavy framework

默认的还是通信方式还是torch.distributed诶，只有需要跨资源的时候才需要用到StatelessProcessGroup，如果不支持这个的话，没法合入到verl呀😆 ，不支持训推分离的架构

HubertZhang · 2026-01-09T13:23:23Z

是否应当设计一个 protocol DistrubutedLib，给 ps 传入一个 dist: DistributedLib 比较好一些？目前这个 import 的写法感觉隔离的还不太够？

x1314aq · 2026-01-12T09:13:46Z

是否应当设计一个 protocol DistrubutedLib，给 ps 传入一个 dist: DistributedLib 比较好一些？目前这个 import 的写法感觉隔离的还不太够？

新增了一个dist_wrapper.py文件，把处理dist的逻辑放进去了，这样ps.py和update.py就可以通过from dist_wrapper import dist直接使用dist模块，不用关心具体实现

x1314aq · 2026-01-12T09:36:32Z

It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think ps.py should be a lightweight component, which may not depend on other heavy framework

In most cases, the logic remains consistent with that before. Only need to depend on vLLM when the cutom distribued module is required. It does not change the way that checkpoint-engien is a lightweight component.

x1314aq · 2026-01-13T08:07:58Z

是否应当设计一个 protocol DistrubutedLib，给 ps 传入一个 dist: DistributedLib 比较好一些？目前这个 import 的写法感觉隔离的还不太够？

新增了一个dist_wrapper.py文件，把处理dist的逻辑放进去了，这样ps.py和update.py就可以通过from dist_wrapper import dist直接使用dist模块，不用关心具体实现

本地试了下用dist_wrapper.py包装有问题，现在改成在distributed/base.py里面把torch.distributed包装进去。用法上只需要把import torch.distributed as dist替换成import checkpoint_engine.distributed as dist。

# import torch.distributed as dist
import checkpoint_engine.distributed as dist

dist.init_process_group()
dist.all_reduce()
dist.xxxx()

如果需要使用custom distributed模块的话，只需要给ps传一个custom_dist=True就行。

2. cache uuid in inference engine

x1314aq · 2026-01-15T13:10:58Z

@weixiao-huang @HubertZhang

如果没有其他review意见的话，能否合入下？

If no more comments, can this be merged?

HubertZhang · 2026-01-15T14:27:07Z

话说要不要抽象 StatelessProcessGroup 而非 dist 呢，在 ps 中直接使用封装好的高级方法看起来方便很多？想象中 sub group 的部分可能会复杂一点但是其他的地方应当简单很多？

# dist/vllm.py
from vllm.distributed import StatelessProcessGroup
VLLMStatelessProcessGroup = StatelessProcessGroup

# ps.py
class ParameterServer:
    def __init__(self, grouo):
        self.group = group
        ...
    def gather(self):
        self.group.broadcast(self.metas)

kip-cxj force-pushed the main branch 2 times, most recently from 1b27b3f to f989a80 Compare December 17, 2025 07:12

This was referenced Dec 18, 2025

[wip][BREAKING][recipe, ckpt]add checkpoint engine for one step off policy volcengine/verl#4601

Open

是否可以接受引入torch.distributed以外的集合通信库？ #71

Open

kip-cxj changed the title ~~draft: add collective communication for npu~~ draft: add stateless communication for npu Dec 30, 2025

x1314aq force-pushed the main branch from 313ba09 to 533bc5d Compare January 6, 2026 02:47

kip-cxj changed the title ~~draft: add stateless communication for npu~~ feat: Replace torch.distributed with StatelessProcessGroup Jan 8, 2026

kip-cxj changed the title ~~feat: Replace torch.distributed with StatelessProcessGroup~~ feat: add StatelessProcessGroup to extend collective library Jan 8, 2026

x1314aq force-pushed the main branch from 3a3db95 to d9bc500 Compare January 10, 2026 10:01

kip-cxj and others added 14 commits January 13, 2026 17:20

1. add collective communication for npu

bd36c20

2. cache uuid in inference engine

add statelesscommgroup

dacdc90

fix bugs

88d192a

implement PyNcclCommunicatorEx

2a662e1

fix rebase issues

4a0dd8c

split distributed.py into distributed_nccl.py & distributed_hccl.py

c2c59e0

fix ncclBroadcast illegal memory access

3865aac

export distributed functions

cc14072

fix bugs

e2358ed

fix bugs

7f328b2

modify ps.py

edef87b

add missing global statement

ab74dff

use dist.device instead of dist.rank

c919559

add distributed abstraction

9e605d6

yexin added 7 commits January 13, 2026 17:21

fix bugs

9b2fb0a

add dist_wrapper.py

5eb55e3

fix bug when CUDA_VISABLE_DEVICES is set

d4ddf2f

fix import path

0fa5a92

remove dist_wrapper.py

a62458d

fix codecheck issues

a2f528d

fix bug of pg resource leak

305886f

x1314aq force-pushed the main branch from d522cb4 to 305886f Compare January 13, 2026 09:23

kip-cxj mentioned this pull request Jan 16, 2026

[ckpt] feat: add kimi ckpt engine backend volcengine/verl#4954

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add StatelessProcessGroup to extend collective library #66

feat: add StatelessProcessGroup to extend collective library #66

Uh oh!

kip-cxj commented Dec 16, 2025 •

edited

Loading

Uh oh!

x1314aq commented Jan 7, 2026 •

edited

Loading

Uh oh!

weixiao-huang commented Jan 9, 2026

Uh oh!

hanhan-networking commented Jan 9, 2026

Uh oh!

HubertZhang commented Jan 9, 2026

Uh oh!

x1314aq commented Jan 12, 2026

Uh oh!

x1314aq commented Jan 12, 2026

Uh oh!

x1314aq commented Jan 13, 2026 •

edited

Loading

Uh oh!

x1314aq commented Jan 15, 2026

Uh oh!

HubertZhang commented Jan 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

feat: add StatelessProcessGroup to extend collective library #66

Are you sure you want to change the base?

feat: add StatelessProcessGroup to extend collective library #66

Uh oh!

Conversation

kip-cxj commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

x1314aq commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weixiao-huang commented Jan 9, 2026

Uh oh!

hanhan-networking commented Jan 9, 2026

Uh oh!

HubertZhang commented Jan 9, 2026

Uh oh!

x1314aq commented Jan 12, 2026

Uh oh!

x1314aq commented Jan 12, 2026

Uh oh!

x1314aq commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

x1314aq commented Jan 15, 2026

Uh oh!

HubertZhang commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kip-cxj commented Dec 16, 2025 •

edited

Loading

x1314aq commented Jan 7, 2026 •

edited

Loading

x1314aq commented Jan 13, 2026 •

edited

Loading

HubertZhang commented Jan 15, 2026 •

edited

Loading