-
Notifications
You must be signed in to change notification settings - Fork 73
feat: add StatelessProcessGroup to extend collective library #66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
1b27b3f to
f989a80
Compare
|
@weixiao-huang @HubertZhang pls review this PR test both on npu and cuda.
test the same model using default torch.distributed module.
|
|
It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think |
默认的还是通信方式还是torch.distributed诶,只有需要跨资源的时候才需要用到StatelessProcessGroup,如果不支持这个的话,没法合入到verl呀😆 ,不支持训推分离的架构 |
|
是否应当设计一个 protocol DistrubutedLib,给 ps 传入一个 dist: DistributedLib 比较好一些?目前这个 import 的写法感觉隔离的还不太够? |
新增了一个 |
In most cases, the logic remains consistent with that before. Only need to depend on vLLM when the cutom distribued module is required. It does not change the way that checkpoint-engien is a lightweight component. |
本地试了下用 # import torch.distributed as dist
import checkpoint_engine.distributed as dist
dist.init_process_group()
dist.all_reduce()
dist.xxxx()如果需要使用 |
2. cache uuid in inference engine
|
如果没有其他review意见的话,能否合入下? If no more comments, can this be merged? |
|
话说要不要抽象 StatelessProcessGroup 而非 dist 呢,在 ps 中直接使用封装好的高级方法看起来方便很多?想象中 sub group 的部分可能会复杂一点但是其他的地方应当简单很多? |
Motivation
Add stateless communication group. To enable more flexible creation of communication groups and resolve compatibility issues with other programs that also use the
torch.distributed.Current support vllm, while sglang does not yet supprt
pyhccl. Which feature depends on add pyhccl in sglang.If the current approach in accptable, we will provide sglang version soon.