Completes bfloat16 dtype for collective api in eager mode#45844
Completes bfloat16 dtype for collective api in eager mode#45844sljlp merged 7 commits intoPaddlePaddle:developfrom
Conversation
|
你的PR提交成功,感谢你对开源项目的贡献! |
There was a problem hiding this comment.
不加这行就会报错,加上这行之后就跑起来了🤔从昨天的测试结果来看好像没问题
There was a problem hiding this comment.
不是,不加这行肯定会报错,就我看gloo内部好像不支持bf16,比较好奇为什么这么可以过测试
There was a problem hiding this comment.
有没有一种可能,paddle 的 bf16 tensor 里面装的实际上是 uint16,种种迹象表明他在 host 上好像并没有真正用 bf16?因为用的实际上是 uint16 所以能跑起来
There was a problem hiding this comment.
那nccl里支持的bfloat16和直接用uint16传有啥区别吗
There was a problem hiding this comment.
nccl好像会判断cuda然后看能不能用bf16,gloo可能直接就用uint16了?
eab6052 to
ad49787
Compare
812d1ef to
911d02e
Compare
There was a problem hiding this comment.
As the code below, they use uint16 to represent bf16 for some reason🤔
Paddle/paddle/phi/common/bfloat16.h
Lines 74 to 79 in 75528ad
There was a problem hiding this comment.
And it seems that we cannot use to_tensor or cast to get a uint16 tensor.
There was a problem hiding this comment.
This issue mentioned the uint16 problem, #34927
* Support both use_calc_stream and sync_op in send recv APIs (#46023) * Support both use_calc_stream and sync_op in allgather API (#46295) * Support both use_calc_stream and sync_op in collective communication API (#46761) * Move group and all reduce from collective to communication (#45848) * Completes bfloat16 dtype for collective api in eager mode (#45844) * Fix collective APIs cannot be recognized when building docs (#46962) Co-authored-by: LiYuRio <63526175+LiYuRio@users.noreply.github.com>
PR types
New features
PR changes
OPs
Describe
This pr completes the basic function of communication framework, support various data types.
通信框架功能进一步补全,通信操作支持传输丰富的数据类型。