Fix/ttm deadlock and rdma pin limit by chun-wan · Pull Request #212 · ROCm/amdgpu

chun-wan · 2026-04-12T18:01:56Z

Motivation

Fix two critical issues on MI300X/MI308X with multi-process RDMA workloads under VRAM pressure:

TTM gtt_window_lock live-lock (amdgpu_ttm.c): Replace mutex_lock() with mutex_lock_interruptible() in amdgpu_ttm_copy_mem_to_mem() to prevent permanent D2H hang.
Unbounded RDMA BO pinning (amdgpu_amdkfd_gpuvm.c, amdgpu_amdkfd.h): Add dedicated rdma_pinned_bytes counter and enforce dmabuf_pin_max_mb to cap RDMA-pinned VRAM per GPU.
PeerDirect error propagation (kfd_peerdirect.c): Distinguish quota rejections (-ENOSPC) from other pin errors.
Debug parameter (amdgpu_drv.c, amdgpu.h): Add rdma_pin_debug module parameter.

Technical Details

Root Cause

amdgpu_ttm_copy_mem_to_mem() uses non-interruptible mutex_lock() on gtt_window_lock. When processes deadlock during eviction restore, neither can back off → permanent hang after 30-40 min.
amdgpu_amdkfd_gpuvm_pin_bo() has no RDMA-specific accounting or limit → unbounded pin growth starves other processes.

Test Plan

Test 1 (dmabuf_pin_max_mb=0): 0 hangs, 0 stalls, 15.7M D2H ops
Test 2 (dmabuf_pin_max_mb=512): RDMA pins capped at 512 MB, 126 excess rejected with -ENOSPC, 0 hangs

Test Result

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

…mem_to_mem When multiple processes trigger KFD BO eviction and subsequent restore simultaneously, the non-interruptible mutex_lock() on gtt_window_lock in amdgpu_ttm_copy_mem_to_mem() can cause a live-lock: Process A holds the lock waiting for Process B to release a BO, while Process B waits for the same lock. Since mutex_lock() is not interruptible, neither process can back off, resulting in a permanent D2H hang. This was observed on MI300X systems running multi-process RDMA workloads under VRAM pressure, where the hang typically occurs after 30-40 minutes of sustained operation. Replace mutex_lock() with mutex_lock_interruptible() so the wait can be interrupted by signals, returning -ERESTARTSYS to allow the TTM subsystem to retry or abort gracefully. Tested on MI300X/MI308X (gfx942) with 8 GPUs under extreme VRAM pressure (192/196 GB utilized) for 40+ minutes with zero hangs. Signed-off-by: Chun Wan <chun-wan@amd.com>

RDMA PeerDirect operations pin GPU buffer objects in VRAM, making them ineligible for eviction. Without any limit, a misbehaving or compromised RDMA peer can pin all available VRAM, starving other processes and triggering cascading eviction failures that lead to system hangs. Add a dedicated atomic64_t rdma_pinned_bytes counter in amdgpu_kfd_dev to track RDMA-pinned VRAM independently from the general vram_pinned counter. In amdgpu_amdkfd_gpuvm_pin_bo(), enforce the existing dmabuf_pin_max_mb module parameter using atomic64_add_return() for race-free accounting. If the total would exceed the configured limit, roll back and return -ENOSPC. Also improve PeerDirect error logging in kfd_peerdirect.c to distinguish quota rejections (-ENOSPC) from other pin errors, and add a rdma_pin_debug module parameter for optional runtime logging. Tested on MI300X/MI308X with 128 RDMA pin attempts (32 GB total): - Without limit (dmabuf_pin_max_mb=0): 120 pins succeed (30 GB pinned) - With limit (dmabuf_pin_max_mb=512): only 2 pins succeed (512 MB), 126 correctly rejected with -ENOSPC in dmesg - No hangs or GPU resets in either configuration over 40 minutes Signed-off-by: Chun Wan <chun-wan@amd.com>

amdgpu_amdkfd_gpuvm_pin_bo() increments rdma_pinned_bytes before amdgpu_bo_pin() when the domain includes VRAM. If pinning succeeds but the buffer ends up outside VRAM, unpin_bo() never subtracts from rdma_pinned_bytes (it only does so for TTM_PL_VRAM), leaking quota. Roll back the pre-accounted bytes in that case.

chun-wan added 3 commits April 13, 2026 01:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/ttm deadlock and rdma pin limit#212

Fix/ttm deadlock and rdma pin limit#212
chun-wan wants to merge 3 commits intoROCm:masterfrom
chun-wan:fix/ttm-deadlock-and-rdma-pin-limit

chun-wan commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

chun-wan commented Apr 12, 2026

Motivation

Technical Details

Root Cause

Test Plan

Test Result

Submission Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant