Skip to content

关于iluvatar_gpu在bincount算子上的异常行为 #2358

@PlumBlossomMaid

Description

@PlumBlossomMaid

关于iluvatar_gpu在bincount算子上的异常行为

我在AI Studio上面使用Iluvatar BI-V150S这张卡的时候,运行了如下代码:

import paddle
input = paddle.ones([100],dtype=paddle.int64)
# 使用CPU计算第三行代码,几乎会瞬间出结果。
paddle.bincount(x=input,minlength=100) # !!!

当使用bincount算子进行计算的时候,加速卡会显示存在一定的占用率,显存也会有占用,但是,bincount算子会一直卡在那里无法继续计算
如果此时此刻按Ctrl + C进行打断的话,会有如下报错:

>>> paddle.bincount(x=input,minlength=100)
^C

--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::pybind::eager_api_bincount(_object*, _object*, _object*)
1   bincount_ad_func(paddle::Tensor const&, paddle::optional<paddle::Tensor> const&, paddle::experimental::ScalarBase<paddle::Tensor>, paddle::optional<paddle::Tensor*>)
2   paddle::experimental::bincount(paddle::Tensor const&, paddle::optional<paddle::Tensor> const&, paddle::experimental::ScalarBase<paddle::Tensor> const&, paddle::optional<paddle::Tensor*>)
3   void phi::BincountCUDAInner<phi::CustomContext, long, long>(phi::CustomContext const&, phi::DenseTensor const&, paddle::optional<phi::DenseTensor> const&, long, phi::DenseTensor*)
4   void phi::Copy<phi::CustomContext>(phi::CustomContext const&, phi::DenseTensor const&, phi::Place, bool, phi::DenseTensor*)
5   phi::memory_utils::Copy(phi::Place const&, void*, phi::Place const&, void const*, unsigned long, void*)
6   phi::MemoryUtils::Copy(phi::Place const&, void*, phi::Place const&, void const*, unsigned long, void*)
7   void paddle::memory::Copy<phi::Place, phi::Place>(phi::Place, void*, phi::Place, void const*, unsigned long, void*)
8   void paddle::memory::Copy<phi::CPUPlace, phi::CustomPlace>(phi::CPUPlace, void*, phi::CustomPlace, void const*, unsigned long, void*)
9   phi::CustomDevice::MemoryCopyD2H(unsigned long, void*, void const*, unsigned long, phi::stream::Stream const*)
10  phi::CustomDevice::SynchronizeStream(unsigned long, void*)
11  SyncStream(C_Device_st*, C_Stream_st*)

----------------------
Error Message Summary:
----------------------
FatalError: `Termination signal` is detected by the operating system.
  [TimeInfo: *** Aborted at 1768708118 (unix time) try "date -d @1768708118" if you are using GNU date ***]
  [SignalInfo: *** SIGTERM (@0x3e8000200df) received by PID 131295 (TID 0x7f727ecdd780) from PID 131295 ***]

Terminated

看上去bincount算子确实在实现这块有问题啊……

Image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions