-
Notifications
You must be signed in to change notification settings - Fork 19
Description
NVIDIA Open GPU Kernel Modules Version
a9284ecf7ab29e599e96de82168484728627eb7e06727467053719b785401e0a /root/xconn/wade/open-gpu-kernel-modules/kernel-open/nvidia.ko
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Description: Ubuntu 22.04.5 LTS
Kernel Release
Linux h3 6.8.0-78-generic NVIDIA#78~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Aug 13 14:32:06 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- I am running on a stable kernel release.
Hardware: GPU
After inputting"nvidia-smi -L", terminal hangs.
Describe the bug
I have 1 AMD Turin server + 2 Nvidia A10 cards on it.
AMD Server can recognize the 2 A10 cards.
lspci:
+-[0000:c0]-+-00.0 Advanced Micro Devices, Inc. [AMD] Turin Root Complex
| +-00.3 Advanced Micro Devices, Inc. [AMD] Turin RCEC
| +-01.0 Advanced Micro Devices, Inc. [AMD] Turin PCIe Dummy Host Bridge
| +-01.1-[c1-c4]----00.0-[c2-c4]--+-07.0-[c3]----00.0 NVIDIA Corporation GA102GL [A10]
| | -0a.0-[c4]----00.0 NVIDIA Corporation GA102GL [A10]
| -02.0 Advanced Micro Devices, Inc. [AMD] Turin PCIe Dummy Host Bridge
nvidia-smi shows the following error message.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
dmesg has the following error message.
[ 271.395530] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 271.395537] NVRM: GPU 0000:c3:00.0 is already bound to nouveau.
[ 271.397851] NVRM: GPU 0000:c4:00.0 is already bound to nouveau.
[ 271.397920] NVRM: The NVIDIA probe routine was not called for 2 device(s).
[ 271.397921] NVRM: This can occur when another driver was loaded and
NVRM: obtained ownership of the NVIDIA device(s).
[ 271.397922] NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.
[ 271.397922] NVRM: No NVIDIA devices probed.
[ 271.398524] nvidia-nvlink: Unregistered Nvlink Core, major device number 510
To Reproduce
Reproduce steps:
- Power on AMD Turin servers
- lspci -vt // confirm server can recognize the 2 Nvidia A10 cards
- nvidia-smi
- dmesg | tail -n 50
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
I did the same test on AMD Turin server with 2 RTX5090 and it worked as below.
- nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.44.01 Driver Version: 590.44.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:21:00.0 Off | N/A |
| 0% 28C P8 12W / 600W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 Off | 00000000:C1:00.0 Off | N/A |
| 0% 27C P8 14W / 600W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+ - p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 5090, pciBusID: c3, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 5090, pciBusID: c4, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1522.95 11.47
1 11.46 1556.27
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 1525.93 57.19
1 57.19 1547.03
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1527.32 11.57
1 11.46 1540.12
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1528.07 112.29
1 112.34 1538.58
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 2.09 15.43
1 15.42 2.08
CPU 0 1
0 2.27 6.21
1 6.21 2.24
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 2.07 0.37
1 0.45 2.07
CPU 0 1
0 2.28 1.58
1 1.59 2.23