Skip to content

prov/rxd: optimize TX critical path#11967

Open
miharulidze wants to merge 10 commits intoofiwg:mainfrom
miharulidze:rxd-opt
Open

prov/rxd: optimize TX critical path#11967
miharulidze wants to merge 10 commits intoofiwg:mainfrom
miharulidze:rxd-opt

Conversation

@miharulidze
Copy link
Copy Markdown
Contributor

This PR introduces several optimizations in RxD provider datapath to optimise throughput and latency:

  1. Avoid payload memory copy on TX path:
  • utilise fi_sendv primitive of DGRAM provider to post header and payload as two entries in iov
  • add memory registration of the user send buffer in the DGRAM provider domain
  1. Batched CQ polling
  • replace DGRAM EP CQ spinning in loop with batched CQ polling
  • split TX and RX CQ
  1. Don't generate completions for control path packets
  2. Avoid ofi_gettime_ms call when timestamp is not used

Baseline performance on CX-5 100G testbed:

[mkhalilo@slimfly24 libfabric]$ FI_OFI_RXD_MAX_UNACKED=256 fi_rdm_bw_mt -p "verbs;ofi_rxd" -s 148.187.111.34 --pin-core 0 -l -b -S all -n 1
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
1       1k      1000        0.00s      0.33       3.05       0.33
2       1k      1.9k        0.00s      0.67       2.99       0.33
3       1k      2.9k        0.00s      1.03       2.91       0.34
4       1k      3.9k        0.00s      1.39       2.88       0.35
6       1k      5.8k        0.00s      2.05       2.92       0.34
8       1k      7.8k        0.00s      2.77       2.89       0.35
12      1k      11k         0.00s      4.18       2.87       0.35
16      1k      15k         0.00s      5.54       2.89       0.35
24      1k      23k         0.00s      8.31       2.89       0.35
32      1k      31k         0.00s     11.25       2.84       0.35
48      1k      46k         0.00s     16.68       2.88       0.35
64      1k      62k         0.00s     22.09       2.90       0.35
96      1k      93k         0.00s     32.93       2.91       0.34
128     1k      125k        0.00s     43.79       2.92       0.34
192     1k      187k        0.00s     56.72       3.38       0.30
256     1k      250k        0.00s     74.88       3.42       0.29
384     1k      375k        0.00s    113.54       3.38       0.30
512     1k      500k        0.00s    148.45       3.45       0.29
768     1k      750k        0.00s    216.95       3.54       0.28
1k      1k      1000k       0.00s    284.76       3.60       0.28
1.5k    1k      1.4m        0.00s    399.48       3.85       0.26
2k      1k      1.9m        0.00s    514.70       3.98       0.25
3k      1k      2.9m        0.00s    703.46       4.37       0.23
4k      1k      3.9m        0.01s    767.90       5.33       0.19
6k      1k      5.8m        0.01s   1124.66       5.46       0.18
8k      1k      7.8m        0.01s   1277.80       6.41       0.16
12k     1k      11m         0.01s   1640.81       7.49       0.13
16k     1k      15m         0.01s   2013.02       8.14       0.12
24k     1k      23m         0.01s   2671.89       9.20       0.11
32k     1k      31m         0.01s   3055.29      10.73       0.09
48k     1k      46m         0.01s   3818.52      12.87       0.08
64k     1k      62m         0.01s   4395.14      14.91       0.07
96k     1k      93m         0.02s   5157.34      19.06       0.05
128k    1k      125m        0.02s   5633.63      23.27       0.04
192k    1k      187m        0.03s   6289.04      31.26       0.03
256k    1k      250m        0.04s   6557.53      39.98       0.03
384k    1k      375m        0.06s   6688.60      58.79       0.02
512k    1k      500m        0.08s   6533.02      80.25       0.01
768k    1k      750m        0.13s   6239.89     126.03       0.01
1m      1k      1000m       0.23s   4493.25     233.37       0.00
1.5m    1k      1.4g        0.28s   5576.86     282.03       0.00
2m      1k      1.9g        0.41s   5123.24     409.34       0.00
3m      1k      2.9g        0.58s   5390.29     583.59       0.00
4m      1k      3.9g        0.80s   5226.62     802.49       0.00
6m      1k      5.8g        1.10s   5726.00    1098.75       0.00
8m      1k      7.8g        1.50s   5593.55    1499.69       0.00

Performance after this PR:

[mkhalilo@slimfly24 libfabric]$ FI_OFI_RXD_MAX_UNACKED=256 fi_rdm_bw_mt -p "verbs;ofi_rxd" -s 148.187.111.34 --pin-core 0 -l -b -S all -n 1
bytes   iters   total       time     MB/sec    usec/xfer   Mxfers/sec
1       1k      1000        0.00s      0.33       3.02       0.33
2       1k      1.9k        0.00s      0.68       2.95       0.34
3       1k      2.9k        0.00s      1.03       2.92       0.34
4       1k      3.9k        0.00s      1.39       2.88       0.35
6       1k      5.8k        0.00s      2.09       2.88       0.35
8       1k      7.8k        0.00s      2.78       2.88       0.35
12      1k      11k         0.00s      4.19       2.86       0.35
16      1k      15k         0.00s      5.60       2.86       0.35
24      1k      23k         0.00s      8.34       2.88       0.35
32      1k      31k         0.00s     11.23       2.85       0.35
48      1k      46k         0.00s     16.69       2.88       0.35
64      1k      62k         0.00s     22.19       2.88       0.35
96      1k      93k         0.00s     33.06       2.90       0.34
128     1k      125k        0.00s     44.06       2.90       0.34
192     1k      187k        0.00s     56.45       3.40       0.29
256     1k      250k        0.00s     74.51       3.44       0.29
384     1k      375k        0.00s    113.17       3.39       0.29
512     1k      500k        0.00s    148.53       3.45       0.29
768     1k      750k        0.00s    217.50       3.53       0.28
1k      1k      1000k       0.00s    280.39       3.65       0.27
1.5k    1k      1.4m        0.00s    400.73       3.83       0.26
2k      1k      1.9m        0.00s    516.91       3.96       0.25
3k      1k      2.9m        0.00s    706.37       4.35       0.23
4k      1k      3.9m        0.01s    773.12       5.30       0.19
6k      1k      5.8m        0.01s   1114.66       5.51       0.18
8k      1k      7.8m        0.01s   1314.93       6.23       0.16
12k     1k      11m         0.01s   1719.32       7.15       0.14
16k     1k      15m         0.01s   2092.46       7.83       0.13
24k     1k      23m         0.01s   2685.61       9.15       0.11
32k     1k      31m         0.01s   3210.66      10.21       0.10
48k     1k      46m         0.01s   3879.40      12.67       0.08
64k     1k      62m         0.01s   4519.10      14.50       0.07
96k     1k      93m         0.02s   5333.33      18.43       0.05
128k    1k      125m        0.02s   5965.41      21.97       0.05
192k    1k      187m        0.03s   6777.25      29.01       0.03
256k    1k      250m        0.04s   7159.67      36.61       0.03
384k    1k      375m        0.05s   7862.59      50.01       0.02
512k    1k      500m        0.07s   7490.26      70.00       0.01
768k    1k      750m        0.11s   7109.18     110.62       0.01
1m      1k      1000m       0.15s   7129.82     147.07       0.01
1.5m    1k      1.4g        0.22s   7153.02     219.89       0.00
2m      1k      1.9g        0.29s   7153.71     293.16       0.00
3m      1k      2.9g        0.44s   7194.54     437.24       0.00
4m      1k      3.9g        0.59s   7168.30     585.12       0.00
6m      1k      5.8g        0.89s   7104.97     885.50       0.00
8m      1k      7.8g        1.18s   7102.83    1181.02       0.00

@miharulidze
Copy link
Copy Markdown
Contributor Author

I see that CI failures are related. I'll fix them in the next commit.

@zachdworkin
Copy link
Copy Markdown
Contributor

To get you started with fixing the errors:

Please add descriptions of changes in the commits and sign-off messages (should quickly fix your DCO failure)

Here are some of the common errors from Intel CI:

  • fi_multi_recv -e rdm -p "udp" -b -s "node"
    fi_mr_reg(): functional/multi_recv.c:246, ret=-266 (Required key not available)
    double free or corruption (fasttop)

  • fi_rdm_atomic -o all -I 1000 -U -p "verbs;ofi_rxd" -b
    times out

  • fi_ubertest (verbs-rxd)
    times out

Here are appveyor ones:
C:\projects\libfabric\prov\rxd\src\rxd_ep.c(1094,49): error C2057: expected constant expression [C:\projects\libfabric\libfabric.vcxproj]
C:\projects\libfabric\prov\rxd\src\rxd_ep.c(1094,49): error C2466: cannot allocate an array of constant size 0 [C:\projects\libfabric\libfabric.vcxproj]
C:\projects\libfabric\prov\rxd\src\rxd_ep.c(1094,50): error C2133: 'cqes': unknown size

@miharulidze
Copy link
Copy Markdown
Contributor Author

@zachdworkin would you please share with me logs of Intel Jenkins? thanks!

@zachdworkin
Copy link
Copy Markdown
Contributor

@zachdworkin would you please share with me logs of Intel Jenkins? thanks!

You only are creating failures in fabtests. The other middlewares are reporting passes (for now)

Fabtests UDP failures: 1 (reg, dbg, dl builds all the same)
server: fi_rdm_tagged_peek -p "udp" -b -s n1
client: fi_rdm_tagged_peek -p "udp" -b -s n2 n1
output: hang no output

Fabtests verbs;ofi_rxd failures: 2 (reg, dbg, dl builds all the same)
server: fi_rdm_atomic -o all -I 1000 -p "verbs;ofi_rxd" -b -s n1
client: fi_rdm_atomic -o all -I 1000 -p "verbs;ofi_rxd" -b -s n2 n1
output: hang after "Provider doesn't support FI_BXOR atomic operation on FI_UINT128"

server: fi_rdm_atomic -o all -I 1000 -U -p "verbs;ofi_rxd" -b -s n1
client: fi_rdm_atomic -o all -I 1000 -U -p "verbs;ofi_rxd" -b -s n2 n1
output: hang after "Provider doesn't support FI_BXOR atomic operation on FI_UINT128"

@miharulidze
Copy link
Copy Markdown
Contributor Author

@zachdworkin would you please share with me logs of Intel Jenkins? thanks!

You only are creating failures in fabtests. The other middlewares are reporting passes (for now)

Fabtests UDP failures: 1 (reg, dbg, dl builds all the same) server: fi_rdm_tagged_peek -p "udp" -b -s n1 client: fi_rdm_tagged_peek -p "udp" -b -s n2 n1 output: hang no output

Fabtests verbs;ofi_rxd failures: 2 (reg, dbg, dl builds all the same) server: fi_rdm_atomic -o all -I 1000 -p "verbs;ofi_rxd" -b -s n1 client: fi_rdm_atomic -o all -I 1000 -p "verbs;ofi_rxd" -b -s n2 n1 output: hang after "Provider doesn't support FI_BXOR atomic operation on FI_UINT128"

server: fi_rdm_atomic -o all -I 1000 -U -p "verbs;ofi_rxd" -b -s n1 client: fi_rdm_atomic -o all -I 1000 -U -p "verbs;ofi_rxd" -b -s n2 n1 output: hang after "Provider doesn't support FI_BXOR atomic operation on FI_UINT128"

Thank you! I can reproduce it locally.

@miharulidze
Copy link
Copy Markdown
Contributor Author

@zachdworkin I've been debugging this locally and hopefully now fi_rdm_atomic tests now should be fixed.

Am I understanding right that fi_rdm_tagged_peek with UDP still fails? For some reason I can't reproduce it on my M1 Mac and x86/ConnectX-5 testbeds.

Thank you!

@zachdworkin
Copy link
Copy Markdown
Contributor

@zachdworkin I've been debugging this locally and hopefully now fi_rdm_atomic tests now should be fixed.

Am I understanding right that fi_rdm_tagged_peek with UDP still fails? For some reason I can't reproduce it on my M1 Mac and x86/ConnectX-5 testbeds.

Thank you!

The rdm_tagged_peek test is passing now. I ran it 50 times without failure. Now the only failing test is inside ubertest and its #14. My log isn't telling me why its failing. I can look into it more if you need

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants