forked from tinygrad/open-gpu-kernel-modules
-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Description
Hello, thanks for your great work!
I have noticed, and as we have discussed, an issue with the driver, that I have noticed very recently, on 5090s.
My setup is:
- AMD Ryzen 9 9900X
- MSI Carbon X670E
- 192GB DDR5 6000 Mhz
- RTX 5090x2
- RTX 4090x2
- RTX A6000
- NVIDIA A40
- Fedora 42
Disabled IOMMU in BIOS, and also added flags into grub as mentioned on first post.
Installed driver -> P2P driver.
P2P between 4090s, or between A6000 and A40 works correctly.
But when you have a monitor/screen attached to a 5090, issue arises between 5090s.
Bandwidth seems to "work":
./p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 5090, pciBusID: 1, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 5090, pciBusID: 3, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1743.86 24.85
1 24.91 1771.60
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 1740.04 28.66
1 28.68 1763.54
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1747.67 30.37
1 30.41 1765.44
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1743.77 56.26
1 56.27 1775.51
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 2.07 12.32
1 14.36 2.07
CPU 0 1
0 1.54 3.95
1 3.86 2.42
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 2.07 0.36
1 0.43 2.07
CPU 0 1
0 1.55 1.05
1 1.09 1.53
But the moment you try simpleP2P, or use it on vLLM for example, it fails on the former, or it never loads on the latter.
./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 5090 (GPU0) -> NVIDIA GeForce RTX 5090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 5090 (GPU1) -> NVIDIA GeForce RTX 5090 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 26.60GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Verification error @ element 1: val = 0.000000, ref = 4.000000
Verification error @ element 2: val = 0.000000, ref = 8.000000
Verification error @ element 3: val = 0.000000, ref = 12.000000
Verification error @ element 4: val = 0.000000, ref = 16.000000
Verification error @ element 5: val = 0.000000, ref = 20.000000
Verification error @ element 6: val = 0.000000, ref = 24.000000
Verification error @ element 7: val = 0.000000, ref = 28.000000
Verification error @ element 8: val = 0.000000, ref = 32.000000
Verification error @ element 9: val = 0.000000, ref = 36.000000
Verification error @ element 10: val = 0.000000, ref = 40.000000
Verification error @ element 11: val = 0.000000, ref = 44.000000
Verification error @ element 12: val = 0.000000, ref = 48.000000
Disabling peer access...
Shutting down...
Test failed!
I had an HDMI dummy on a 5090 to see it remotely, and by just testing, removing it, made it work again.
./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2
Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 5090 (GPU0) -> NVIDIA GeForce RTX 5090 (GPU1) : Yes
> Peer access from NVIDIA GeForce RTX 5090 (GPU1) -> NVIDIA GeForce RTX 5090 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 26.56GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed
Also tried a Display Port monitor/cable and same issue occurs.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels