Skip to content

The P2P driver for 5090 can not work with Nvidia A10 graphic card #16

@RogerXconn

Description

@RogerXconn

NVIDIA Open GPU Kernel Modules Version

a9284ecf7ab29e599e96de82168484728627eb7e06727467053719b785401e0a /root/xconn/wade/open-gpu-kernel-modules/kernel-open/nvidia.ko

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Description: Ubuntu 22.04.5 LTS

Kernel Release

Linux h3 6.8.0-78-generic NVIDIA#78~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Aug 13 14:32:06 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • I am running on a stable kernel release.

Hardware: GPU

After inputting"nvidia-smi -L", terminal hangs.

Describe the bug

I have 1 AMD Turin server + 2 Nvidia A10 cards on it.
AMD Server can recognize the 2 A10 cards.
lspci:
+-[0000:c0]-+-00.0 Advanced Micro Devices, Inc. [AMD] Turin Root Complex
| +-00.3 Advanced Micro Devices, Inc. [AMD] Turin RCEC
| +-01.0 Advanced Micro Devices, Inc. [AMD] Turin PCIe Dummy Host Bridge
| +-01.1-[c1-c4]----00.0-[c2-c4]--+-07.0-[c3]----00.0 NVIDIA Corporation GA102GL [A10]
| | -0a.0-[c4]----00.0 NVIDIA Corporation GA102GL [A10]
| -02.0 Advanced Micro Devices, Inc. [AMD] Turin PCIe Dummy Host Bridge

nvidia-smi shows the following error message.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

dmesg has the following error message.
[ 271.395530] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 271.395537] NVRM: GPU 0000:c3:00.0 is already bound to nouveau.
[ 271.397851] NVRM: GPU 0000:c4:00.0 is already bound to nouveau.
[ 271.397920] NVRM: The NVIDIA probe routine was not called for 2 device(s).
[ 271.397921] NVRM: This can occur when another driver was loaded and
NVRM: obtained ownership of the NVIDIA device(s).
[ 271.397922] NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.
[ 271.397922] NVRM: No NVIDIA devices probed.
[ 271.398524] nvidia-nvlink: Unregistered Nvlink Core, major device number 510

To Reproduce

Reproduce steps:

  1. Power on AMD Turin servers
  2. lspci -vt // confirm server can recognize the 2 Nvidia A10 cards
  3. nvidia-smi
  4. dmesg | tail -n 50

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

I did the same test on AMD Turin server with 2 RTX5090 and it worked as below.

  1. nvidia-smi
    +-----------------------------------------------------------------------------------------+
    | NVIDIA-SMI 590.44.01 Driver Version: 590.44.01 CUDA Version: 13.1 |
    +-----------------------------------------+------------------------+----------------------+
    | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
    | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
    | | | MIG M. |
    |=========================================+========================+======================|
    | 0 NVIDIA GeForce RTX 5090 Off | 00000000:21:00.0 Off | N/A |
    | 0% 28C P8 12W / 600W | 2MiB / 32607MiB | 0% Default |
    | | | N/A |
    +-----------------------------------------+------------------------+----------------------+
    | 1 NVIDIA GeForce RTX 5090 Off | 00000000:C1:00.0 Off | N/A |
    | 0% 27C P8 14W / 600W | 2MiB / 32607MiB | 0% Default |
    | | | N/A |
    +-----------------------------------------+------------------------+----------------------+
  2. p2pBandwidthLatencyTest
    [P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
    Device: 0, NVIDIA GeForce RTX 5090, pciBusID: c3, pciDeviceID: 0, pciDomainID:0
    Device: 1, NVIDIA GeForce RTX 5090, pciBusID: c4, pciDeviceID: 0, pciDomainID:0
    Device=0 CAN Access Peer Device=1
    Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.

P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1522.95 11.47
1 11.46 1556.27
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 1525.93 57.19
1 57.19 1547.03
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1527.32 11.57
1 11.46 1540.12
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1528.07 112.29
1 112.34 1538.58
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 2.09 15.43
1 15.42 2.08

CPU 0 1
0 2.27 6.21
1 6.21 2.24
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 2.07 0.37
1 0.45 2.07

CPU 0 1
0 2.28 1.58
1 1.59 2.23

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions