Skip to content

Broadcast hangs on cyt_rdma #202

@lawirz

Description

@lawirz

I observed similar behaviour with other collectives, but thus far only reproduced it with broadcast, so the title may be misleading. I will add comments of similar behaviour with other collectives here later

Calling Broadcast with 4MB hangs on the second rank.

Rank 0

stdout
Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '5' '-c' '1048576' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:1048576 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 0] rank 0 size 2 alveo-u55c-07.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.92
10.253.74.96
Initializing QP connections...
Exchanging QP...
Local rank 0 sending local QP to remote rank 1
Local rank 0 receiving remote QP from remote rank 1
Queue Pair: id: 1
Local Queue: local: QPN 0x000002, PSN 0xcc1ea0, VADDR 00007fc964a00000, SIZE 00200000, IP 0x0afd4a5c,
Remote Queue: remote: QPN 0x000001, PSN 0x4eed7e, VADDR 00007f2da6c00000, SIZE 00200000, IP 0x0afd4a60,
rank: 0 FPGA IP: afd4a5c
Rendezvous Protocol
sw nop time [us]:93.336
hw nop time [ns]:940
Start bcast test with root 0 ...
Repetition 0
Pass accl barrier
host measured durationUs:252146

ACCL base functionality test completed successfully!

-- STATISTICS - ID: 0
-----------------------------------------------
          Read command FIFO used: 	0
         Write command FIFO used: 	0
                 Host reads sent: 	1
                Host writes sent: 	0
                 Card reads sent: 	0
                Card writes sent: 	0
                 Sync reads sent: 	5
                Sync writes sent: 	0
                     Page faults: 	0


 -- �[31m�[1mNET STATS�[0m�[0m QSFP0

RX pkgs: 738
TX pkgs: 1030
ARP RX pkgs: 2
ARP TX pkgs: 2
ICMP RX pkgs: 0
ICMP TX pkgs: 0
TCP RX pkgs: 0
TCP TX pkgs: 0
ROCE RX pkgs: 654
ROCE TX pkgs: 1028
IBV RX pkgs: 646
IBV TX pkgs: 66566
PSN drop cnt: 0
Retrans cnt: 384
TCP session cnt: 0
STRM down: 0


stderr
XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 92386
UID: 500207
[Wed Jun 19 10:50:52 2024 GMT]
HOST: alveo-u55c-07.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
CCLO HWID: 3009117246 at 0x0
CCLO source commit (first 24b): b35b7c
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7fc95fe00000, Size: 64
calling offload: 7fc95fe00000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7fc95fc00000, Size: 64
calling offload: 7fc95fc00000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95f800000, Size: 4194304
calling offload: 7fc95f800000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95f400000, Size: 4194304
calling offload: 7fc95f400000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95f000000, Size: 4194304
calling offload: 7fc95f000000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 0 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7fc95fe00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7fc95fc00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7fc95ec00000, Size: 4194304
Broadcasting data from 0...
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95ec00000
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7fc95fe00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7fc95fc00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

Removing CCLO object at 0
Doing a soft reset
Free user buffer from cProc cPid:0, buffer_size:64,7fc95fe00000
Free user buffer from cProc cPid:0, buffer_size:64,7fc95fc00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f800000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f400000
Free user buffer from cProc cPid:0, buffer_size:4194304,7fc95f000000

Rank 1

stdout
Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '5' '-c' '1048576' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:1048576 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 1] rank 1 size 2 alveo-u55c-08.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.92
10.253.74.96
Initializing QP connections...
Exchanging QP...
Local rank 1 receiving remote QP from remote rank 0
Local rank 1 sending local QP to remote rank 0
Queue Pair: id: 0
Local Queue: local: QPN 0x000001, PSN 0x4eed7e, VADDR 00007f2da6c00000, SIZE 00200000, IP 0x0afd4a60,
Remote Queue: remote: QPN 0x000002, PSN 0xcc1ea0, VADDR 00007fc964a00000, SIZE 00200000, IP 0x0afd4a5c,
rank: 1 FPGA IP: afd4a60
Rendezvous Protocol
sw nop time [us]:86.834
hw nop time [ns]:940
Start bcast test with root 0 ...
Repetition 0
Pass accl barrier

stderr
XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 90744
UID: 500207
[Wed Jun 19 10:50:52 2024 GMT]
HOST: alveo-u55c-08.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
CCLO HWID: 3009117246 at 0x0
CCLO source commit (first 24b): b35b7c
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f2da5e00000, Size: 64
calling offload: 7f2da5e00000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f2da5c00000, Size: 64
calling offload: 7f2da5c00000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da5800000, Size: 4194304
calling offload: 7f2da5800000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da5400000, Size: 4194304
calling offload: 7f2da5400000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da5000000, Size: 4194304
calling offload: 7f2da5000000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 1 	 number of ranks: 2
> rank 0 (ip 10.253.74.92:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.96:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 1 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7f2da5e00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7f2da5c00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f2da4c00000, Size: 4194304
Getting broadcast data from 0...

Running smaller Broadcast operations even if above Rendezvous-threshhold works. When I ran with 128 elements(which is above the threshhold), I broke a machine, though(successive bitstream flashing failed), but this might just have been bad luck.

The other collective I experienced issues with is allreduce, there I get hangs too, but this might be completly unrelated.

Generally, the errors seem to occur, at certain sizes or after a certain amount of repetitions. It might just be a delay after which the machine hangs, as I got hangs in instances, where there isn't even an ACCL collective running. This happened in conjunction with allreduce, and I have trouble reproducing it.

I'm running it on the 200-allreduce-hangs... branch, but I had the same behaviour on the 196 merge commit. I'm fairly confident everything worked before the merge of the 196-fix, but I can try to verify it. I certainly was able to run almost all collectives on HW, sometime before I entered the 196 issue merge.

Everything works in Simulator, in a variety of scenarios.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions