Skip to content

Gather wrong order #198

@lawirz

Description

@lawirz

This issue concerns the branch to resolve issue 196: https://github.com/Xilinx/ACCL/tree/196-reduceallreduce-issues-on-cyt_rdma

Gather sometimes switches up the output of the first rank and the second rank on two-node setups, when run on cyt_rdma. The error is not observed in the emulator setup. In HW, it only happens in around 50% of runs.

Allgather on the other hand doesn't produce erronous behaviour.

It only occured after recompiling test/host/Coyote/test.cpp. The binary compiled on the previous version running with a new bitstream worked.

Rank 0

stdout
Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '7' '-c' '24' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:24 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 0] rank 0 size 2 alveo-u55c-04.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.80
10.253.74.92
Initializing QP connections...
Exchanging QP...
Local rank 0 sending local QP to remote rank 1
Local rank 0 receiving remote QP from remote rank 1
Queue Pair: id: 1
Local Queue: local: QPN 0x000002, PSN 0x2aec2a, VADDR 00007f1980200000, SIZE 00200000, IP 0x0afd4a50,
Remote Queue: remote: QPN 0x000001, PSN 0x6f7034, VADDR 00007f431ce00000, SIZE 00200000, IP 0x0afd4a5c,
rank: 0 FPGA IP: afd4a50
Rendezvous Protocol
sw nop time [us]:92.656
hw nop time [ns]:940
Start gather test with root 0...
Repetition 0
Pass accl barrier
host measured durationUs:42.371
1th item is incorrect! (24.000000 != 0.000000)
2th item is incorrect! (25.000000 != 1.000000)
3th item is incorrect! (26.000000 != 2.000000)
4th item is incorrect! (27.000000 != 3.000000)
5th item is incorrect! (28.000000 != 4.000000)
6th item is incorrect! (29.000000 != 5.000000)
7th item is incorrect! (30.000000 != 6.000000)
8th item is incorrect! (31.000000 != 7.000000)
9th item is incorrect! (32.000000 != 8.000000)
10th item is incorrect! (33.000000 != 9.000000)
11th item is incorrect! (34.000000 != 10.000000)
12th item is incorrect! (35.000000 != 11.000000)
13th item is incorrect! (36.000000 != 12.000000)
14th item is incorrect! (37.000000 != 13.000000)
15th item is incorrect! (38.000000 != 14.000000)
16th item is incorrect! (39.000000 != 15.000000)
17th item is incorrect! (40.000000 != 16.000000)
18th item is incorrect! (41.000000 != 17.000000)
19th item is incorrect! (42.000000 != 18.000000)
20th item is incorrect! (43.000000 != 19.000000)
21th item is incorrect! (44.000000 != 20.000000)
22th item is incorrect! (45.000000 != 21.000000)
23th item is incorrect! (46.000000 != 22.000000)
24th item is incorrect! (47.000000 != 23.000000)
1th item is incorrect! (0.000000 != 24.000000)
2th item is incorrect! (1.000000 != 25.000000)
3th item is incorrect! (2.000000 != 26.000000)
4th item is incorrect! (3.000000 != 27.000000)
5th item is incorrect! (4.000000 != 28.000000)
6th item is incorrect! (5.000000 != 29.000000)
7th item is incorrect! (6.000000 != 30.000000)
8th item is incorrect! (7.000000 != 31.000000)
9th item is incorrect! (8.000000 != 32.000000)
10th item is incorrect! (9.000000 != 33.000000)
11th item is incorrect! (10.000000 != 34.000000)
12th item is incorrect! (11.000000 != 35.000000)
13th item is incorrect! (12.000000 != 36.000000)
14th item is incorrect! (13.000000 != 37.000000)
15th item is incorrect! (14.000000 != 38.000000)
16th item is incorrect! (15.000000 != 39.000000)
17th item is incorrect! (16.000000 != 40.000000)
18th item is incorrect! (17.000000 != 41.000000)
19th item is incorrect! (18.000000 != 42.000000)
20th item is incorrect! (19.000000 != 43.000000)
21th item is incorrect! (20.000000 != 44.000000)
22th item is incorrect! (21.000000 != 45.000000)
23th item is incorrect! (22.000000 != 46.000000)
24th item is incorrect! (23.000000 != 47.000000)
48 errors!

ERROR: ACCL base functionality test failed!

STATISTICS - ID: 0
-----------------------------------------------
          Read command FIFO used: 	0
         Write command FIFO used: 	0
                 Host reads sent: 	1
                Host writes sent: 	2
                 Card reads sent: 	1
                Card writes sent: 	1
                 Sync reads sent: 	5
                Sync writes sent: 	0
                     Page faults: 	0


 �[31m�[1mNET STATS�[0m�[0m QSFP0

RX pkgs: 50
TX pkgs: 5
ARP RX pkgs: 2
ARP TX pkgs: 2
ICMP RX pkgs: 0
ICMP TX pkgs: 0
TCP RX pkgs: 0
TCP TX pkgs: 0
ROCE RX pkgs: 3
ROCE TX pkgs: 3
IBV RX pkgs: 6
IBV TX pkgs: 4
PSN drop cnt: 0
Retrans cnt: 0
TCP session cnt: 0
STRM down: 0

Finalizing MPI...
Done. Terminating...
stderr
XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 256891
UID: 500207
[Wed May 29 21:24:18 2024 GMT]
HOST: alveo-u55c-04.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
CCLO HWID: 4147289406 at 0x0
CCLO source commit (first 24b): f7329d
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f197f600000, Size: 64
calling offload: 7f197f600000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f197f400000, Size: 64
calling offload: 7f197f400000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f197f000000, Size: 4194304
calling offload: 7f197f000000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f197ec00000, Size: 4194304
calling offload: 7f197ec00000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f197e800000, Size: 4194304
calling offload: 7f197e800000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.80:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.92:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 0 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7f197f600000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7f197f400000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:96,n_pages:1
Allocation successful! Allocated buffer: 7f197e600000, Size: 96
CoyoteBuffer contructor called! page_size:2097152, buffer_size:192,n_pages:1
Allocation successful! Allocated buffer: 7f197e400000, Size: 192
Gather data from 0...
Free user buffer from cProc cPid:0, buffer_size:96,7f197e600000
Free user buffer from cProc cPid:0, buffer_size:192,7f197e400000
Communicator 0 (0x40):
local rank: 0 	 number of ranks: 2
> rank 0 (ip 10.253.74.80:5005 ; session 0 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.92:5005 ; session 2 ; max segment size 4194304) : <- inbound seq number 1, -> outbound seq number 0

CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7f197f600000 	 status: ENQUEUED 	 occupancy: 96/64 	 MPI tag: ffffffff 	 seq: 0 	 src: 1
Spare RX Buffer 1:	 address: 0x7f197f400000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

Removing CCLO object at 0
Doing a soft reset
Free user buffer from cProc cPid:0, buffer_size:64,7f197f600000
Free user buffer from cProc cPid:0, buffer_size:64,7f197f400000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f197f000000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f197ec00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f197e80000

Rank 1

stdout
Arguments: '../accl_on_coyote' '-d' '-f' '-r' '-z' '1' '-y' '7' '-c' '24' '-l' './accl_log/fpga' '-p' '1' '-n' '1' 
Running ACCL test in coyote...
Initializing MPI...
Reading MPI rank and size values...
Parsing options
Hardware rdma mode
count:24 rxbuf_size:4194304 seg_size:4194304 num_rxbufmem:2
Getting MPI Processor name...
[process 1] rank 1 size 2 alveo-u55c-07.inf.ethz.ch
Testing ACCL base functionality...
10.253.74.80
10.253.74.92
Initializing QP connections...
Exchanging QP...
Local rank 1 receiving remote QP from remote rank 0
Local rank 1 sending local QP to remote rank 0
Queue Pair: id: 0
Local Queue: local: QPN 0x000001, PSN 0x6f7034, VADDR 00007f431ce00000, SIZE 00200000, IP 0x0afd4a5c,
Remote Queue: remote: QPN 0x000002, PSN 0x2aec2a, VADDR 00007f1980200000, SIZE 00200000, IP 0x0afd4a50,
rank: 1 FPGA IP: afd4a5c
Rendezvous Protocol
sw nop time [us]:73.61
hw nop time [ns]:940
Start gather test with root 0...
Repetition 0
Pass accl barrier
host measured durationUs:91.063

ACCL base functionality test completed successfully!

-- STATISTICS - ID: 0
-----------------------------------------------
          Read command FIFO used: 	0
         Write command FIFO used: 	0
                 Host reads sent: 	1
                Host writes sent: 	0
                 Card reads sent: 	0
                Card writes sent: 	0
                 Sync reads sent: 	5
                Sync writes sent: 	0
                     Page faults: 	0


 -- �[31m�[1mNET STATS�[0m�[0m QSFP0

RX pkgs: 48
TX pkgs: 5
ARP RX pkgs: 2
ARP TX pkgs: 2
ICMP RX pkgs: 0
ICMP TX pkgs: 0
TCP RX pkgs: 0
TCP TX pkgs: 0
ROCE RX pkgs: 3
ROCE TX pkgs: 3
IBV RX pkgs: 4
IBV TX pkgs: 6
PSN drop cnt: 0
Retrans cnt: 0
TCP session cnt: 0
STRM down: 0

Finalizing MPI...
Done. Terminating...
stderr
XRT build version: 2.13.466
Build hash: f5505e402c2ca1ffe45eb6d3a9399b23a0dc8776
Build date: 2022-04-14 17:43:11
Git branch: 2022.1
PID: 286334
UID: 500207
[Wed May 29 21:24:18 2024 GMT]
HOST: alveo-u55c-07.inf.ethz.ch
EXE: /pub/scratch/lawirz/XACCL/integrations/pytorch_ddp/accl/test/host/Coyote/accl_on_coyote
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
[XRT] ERROR: No devices found
ACLL DEBUG: aquiring cProc: targetRegion: 0, cPid: 0
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 1
ACLL DEBUG: aquiring qProc: targetRegion: 0, cPid: 2
CCLO HWID: 4147289406 at 0x0
CCLO source commit (first 24b): f7329d
CCLO Capabilities:
Stack type: RDMA
Internal DMA:True
External DMA:False
Reduction:True
Compression:True
Kernel Streams:True
Debug:False
Doing a soft reset
Configuring Eager RX Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f431c000000, Size: 64
calling offload: 7f431c000000, size: 64
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:64,n_pages:1
Allocation successful! Allocated buffer: 7f4317e00000, Size: 64
calling offload: 7f4317e00000, size: 64
Configuring Rendezvous Spare Buffers
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f4317a00000, Size: 4194304
calling offload: 7f4317a00000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f4317600000, Size: 4194304
calling offload: 7f4317600000, size: 4194304
get_device_type: coyote_device
get_device_type: coyote_device
CoyoteBuffer contructor called! page_size:2097152, buffer_size:4194304,n_pages:2
Allocation successful! Allocated buffer: 7f4317200000, Size: 4194304
calling offload: 7f4317200000, size: 4194304
Configuring a communicator
Configuring arithmetic
Configuring collective tuning parameters
CCLO configured
Set timeout
Set max eager size: 64
Set max rendezvous reduce size: 4194304
Accelerator ready!
Communicator 0 (0x40):
local rank: 1 	 number of ranks: 2
> rank 0 (ip 10.253.74.80:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0
> rank 1 (ip 10.253.74.92:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

Rank 1 passed last barrier before test!
CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7f431c000000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7f4317e00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

CoyoteBuffer contructor called! page_size:2097152, buffer_size:96,n_pages:1
Allocation successful! Allocated buffer: 7f4317000000, Size: 96
Gather data from 1...
Free user buffer from cProc cPid:0, buffer_size:96,7f4317000000
Communicator 0 (0x40):
local rank: 1 	 number of ranks: 2
> rank 0 (ip 10.253.74.80:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 1
> rank 1 (ip 10.253.74.92:5005 ; session 1 ; max segment size 4194304) : <- inbound seq number 0, -> outbound seq number 0

CCLO address: 0
rx address: 4
Spare RX Buffer 0:	 address: 0x7f431c000000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0
Spare RX Buffer 1:	 address: 0x7f4317e00000 	 status: ENQUEUED 	 occupancy: 0/64 	 MPI tag: 0 	 seq: 0 	 src: 0

Removing CCLO object at 0
Doing a soft reset
Free user buffer from cProc cPid:0, buffer_size:64,7f431c000000
Free user buffer from cProc cPid:0, buffer_size:64,7f4317e00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f4317a00000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f4317600000
Free user buffer from cProc cPid:0, buffer_size:4194304,7f4317200000

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions