Open
Conversation
…mem_to_mem When multiple processes trigger KFD BO eviction and subsequent restore simultaneously, the non-interruptible mutex_lock() on gtt_window_lock in amdgpu_ttm_copy_mem_to_mem() can cause a live-lock: Process A holds the lock waiting for Process B to release a BO, while Process B waits for the same lock. Since mutex_lock() is not interruptible, neither process can back off, resulting in a permanent D2H hang. This was observed on MI300X systems running multi-process RDMA workloads under VRAM pressure, where the hang typically occurs after 30-40 minutes of sustained operation. Replace mutex_lock() with mutex_lock_interruptible() so the wait can be interrupted by signals, returning -ERESTARTSYS to allow the TTM subsystem to retry or abort gracefully. Tested on MI300X/MI308X (gfx942) with 8 GPUs under extreme VRAM pressure (192/196 GB utilized) for 40+ minutes with zero hangs. Signed-off-by: Chun Wan <chun-wan@amd.com>
RDMA PeerDirect operations pin GPU buffer objects in VRAM, making them ineligible for eviction. Without any limit, a misbehaving or compromised RDMA peer can pin all available VRAM, starving other processes and triggering cascading eviction failures that lead to system hangs. Add a dedicated atomic64_t rdma_pinned_bytes counter in amdgpu_kfd_dev to track RDMA-pinned VRAM independently from the general vram_pinned counter. In amdgpu_amdkfd_gpuvm_pin_bo(), enforce the existing dmabuf_pin_max_mb module parameter using atomic64_add_return() for race-free accounting. If the total would exceed the configured limit, roll back and return -ENOSPC. Also improve PeerDirect error logging in kfd_peerdirect.c to distinguish quota rejections (-ENOSPC) from other pin errors, and add a rdma_pin_debug module parameter for optional runtime logging. Tested on MI300X/MI308X with 128 RDMA pin attempts (32 GB total): - Without limit (dmabuf_pin_max_mb=0): 120 pins succeed (30 GB pinned) - With limit (dmabuf_pin_max_mb=512): only 2 pins succeed (512 MB), 126 correctly rejected with -ENOSPC in dmesg - No hangs or GPU resets in either configuration over 40 minutes Signed-off-by: Chun Wan <chun-wan@amd.com>
amdgpu_amdkfd_gpuvm_pin_bo() increments rdma_pinned_bytes before amdgpu_bo_pin() when the domain includes VRAM. If pinning succeeds but the buffer ends up outside VRAM, unpin_bo() never subtracts from rdma_pinned_bytes (it only does so for TTM_PL_VRAM), leaking quota. Roll back the pre-accounted bytes in that case.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Fix two critical issues on MI300X/MI308X with multi-process RDMA workloads under VRAM pressure:
amdgpu_ttm.c): Replacemutex_lock()withmutex_lock_interruptible()inamdgpu_ttm_copy_mem_to_mem()to prevent permanent D2H hang.amdgpu_amdkfd_gpuvm.c,amdgpu_amdkfd.h): Add dedicatedrdma_pinned_bytescounter and enforcedmabuf_pin_max_mbto cap RDMA-pinned VRAM per GPU.kfd_peerdirect.c): Distinguish quota rejections (-ENOSPC) from other pin errors.amdgpu_drv.c,amdgpu.h): Addrdma_pin_debugmodule parameter.Technical Details
Root Cause
amdgpu_ttm_copy_mem_to_mem()uses non-interruptiblemutex_lock()ongtt_window_lock. When processes deadlock during eviction restore, neither can back off → permanent hang after 30-40 min.amdgpu_amdkfd_gpuvm_pin_bo()has no RDMA-specific accounting or limit → unbounded pin growth starves other processes.Test Plan
Test Result
Submission Checklist