Skip to content

Fix/ttm deadlock and rdma pin limit#212

Open
chun-wan wants to merge 3 commits intoROCm:masterfrom
chun-wan:fix/ttm-deadlock-and-rdma-pin-limit
Open

Fix/ttm deadlock and rdma pin limit#212
chun-wan wants to merge 3 commits intoROCm:masterfrom
chun-wan:fix/ttm-deadlock-and-rdma-pin-limit

Conversation

@chun-wan
Copy link
Copy Markdown

Motivation

Fix two critical issues on MI300X/MI308X with multi-process RDMA workloads under VRAM pressure:

  1. TTM gtt_window_lock live-lock (amdgpu_ttm.c): Replace mutex_lock() with mutex_lock_interruptible() in amdgpu_ttm_copy_mem_to_mem() to prevent permanent D2H hang.
  2. Unbounded RDMA BO pinning (amdgpu_amdkfd_gpuvm.c, amdgpu_amdkfd.h): Add dedicated rdma_pinned_bytes counter and enforce dmabuf_pin_max_mb to cap RDMA-pinned VRAM per GPU.
  3. PeerDirect error propagation (kfd_peerdirect.c): Distinguish quota rejections (-ENOSPC) from other pin errors.
  4. Debug parameter (amdgpu_drv.c, amdgpu.h): Add rdma_pin_debug module parameter.

Technical Details

Root Cause

  • amdgpu_ttm_copy_mem_to_mem() uses non-interruptible mutex_lock() on gtt_window_lock. When processes deadlock during eviction restore, neither can back off → permanent hang after 30-40 min.
  • amdgpu_amdkfd_gpuvm_pin_bo() has no RDMA-specific accounting or limit → unbounded pin growth starves other processes.

Test Plan

  • Test 1 (dmabuf_pin_max_mb=0): 0 hangs, 0 stalls, 15.7M D2H ops
  • Test 2 (dmabuf_pin_max_mb=512): RDMA pins capped at 512 MB, 126 excess rejected with -ENOSPC, 0 hangs

Test Result

Submission Checklist

…mem_to_mem

When multiple processes trigger KFD BO eviction and subsequent restore
simultaneously, the non-interruptible mutex_lock() on gtt_window_lock in
amdgpu_ttm_copy_mem_to_mem() can cause a live-lock: Process A holds the
lock waiting for Process B to release a BO, while Process B waits for
the same lock. Since mutex_lock() is not interruptible, neither process
can back off, resulting in a permanent D2H hang.

This was observed on MI300X systems running multi-process RDMA workloads
under VRAM pressure, where the hang typically occurs after 30-40 minutes
of sustained operation.

Replace mutex_lock() with mutex_lock_interruptible() so the wait can be
interrupted by signals, returning -ERESTARTSYS to allow the TTM
subsystem to retry or abort gracefully.

Tested on MI300X/MI308X (gfx942) with 8 GPUs under extreme VRAM
pressure (192/196 GB utilized) for 40+ minutes with zero hangs.

Signed-off-by: Chun Wan <chun-wan@amd.com>
RDMA PeerDirect operations pin GPU buffer objects in VRAM, making them
ineligible for eviction. Without any limit, a misbehaving or compromised
RDMA peer can pin all available VRAM, starving other processes and
triggering cascading eviction failures that lead to system hangs.

Add a dedicated atomic64_t rdma_pinned_bytes counter in amdgpu_kfd_dev
to track RDMA-pinned VRAM independently from the general vram_pinned
counter. In amdgpu_amdkfd_gpuvm_pin_bo(), enforce the existing
dmabuf_pin_max_mb module parameter using atomic64_add_return() for
race-free accounting. If the total would exceed the configured limit,
roll back and return -ENOSPC.

Also improve PeerDirect error logging in kfd_peerdirect.c to distinguish
quota rejections (-ENOSPC) from other pin errors, and add a
rdma_pin_debug module parameter for optional runtime logging.

Tested on MI300X/MI308X with 128 RDMA pin attempts (32 GB total):
- Without limit (dmabuf_pin_max_mb=0): 120 pins succeed (30 GB pinned)
- With limit (dmabuf_pin_max_mb=512): only 2 pins succeed (512 MB),
  126 correctly rejected with -ENOSPC in dmesg
- No hangs or GPU resets in either configuration over 40 minutes

Signed-off-by: Chun Wan <chun-wan@amd.com>
amdgpu_amdkfd_gpuvm_pin_bo() increments rdma_pinned_bytes before
amdgpu_bo_pin() when the domain includes VRAM. If pinning succeeds but
the buffer ends up outside VRAM, unpin_bo() never subtracts from
rdma_pinned_bytes (it only does so for TTM_PL_VRAM), leaking quota.

Roll back the pre-accounted bytes in that case.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant