[release/2.12] Fix reentrant deadlock in torch.cuda._lazy_call by iupaikov-amd · Pull Request #3212 · ROCm/pytorch

iupaikov-amd · 2026-05-11T15:09:46Z

This fixes deadlock with torch.compile while running bigger models in MAD engine.

This is a cherry-pick of upstream PR, do not merge before it's landed: pytorch#182948

…udio release/2.12 branch is not available upstream atm

…3180) <h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2> <h3>Problem</h3> <p>A failing unit test on Windows started showing a couple weeks ago and a missing <code>#include</code> was added in [](pytorch#178284), but CI on TheRock kept failing. The fix was tested on gfx1151 (APU), where the test passed, but CI showed failures on gfx1100. </p> <p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting <code>miopen_ctc_loss</code>) crashes with a fatal access violation on Windows systems with discrete AMD GPUs:</p> <pre><code>Windows fatal exception: access violation Exception Code: 0xC0000005 #0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize (MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize (MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll) </code></pre> <h3>Root Cause</h3> <p><code>miopenGetCTCLossWorkspaceSize</code> and <code>miopenCTCLoss</code> read the <code>labels</code>, <code>label_lengths</code>, and <code>input_lengths</code> arrays <strong>on the host side</strong> to plan the computation and calculate workspace requirements. The existing code copies these arrays to GPU memory and passes device pointers:</p> <pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA), at::kInt); // ... hipMemcpy to GPU ... MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(..., labels_gpu.data_ptr<int>(), // device pointer label_lengths_gpu.data_ptr<int>(), // device pointer input_lengths_gpu.data_ptr<int>() // device pointer )); </code></pre> <p>This works on:</p> <ul> <li><strong>Linux</strong> — HSA (Heterogeneous System Architecture) maps GPU allocations into the process virtual address space, making device pointers host-readable</li> <li><strong>Windows APUs</strong> — CPU and iGPU share system RAM, so device pointers point to host-accessible memory</li> </ul> <p>This crashes on:</p> <ul> <li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe; device pointers are opaque handles that cannot be dereferenced from host code</li> </ul> <h3>Verification</h3> <p>Tested on gfx1201:</p> <table border="1" cellpadding="6" cellspacing="0"> <tr><th>Check</th><th>Result</th></tr> <tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code> (discrete GPU)</td></tr> <tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr> <tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code> (unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code> pointer via <code>ctypes</code></td><td>Access violation</td></tr> <tr><td>CTC loss with CPU pointers</td><td>Pass (forward + backward)</td></tr> </table> <h3>Fix</h3> <p>Use host pointers since this is what MIOpen expects should be used.</p> <h3>Testing</h3> <p>Run all existing CTCLoss unit tests.</p> Pull Request resolved: pytorch#179264 Approved by: https://github.com/jeffdaily Co-authored-by: Milica Stankovic <mstankov@amd.com>

Use latest main commit for now, since no release/2.12 branch exists for torchaudio

rocm-repo-management-api · 2026-05-11T15:22:41Z

Jenkins build for 5ce1838b180a53c807b801f2eaae6fc3d70e7a4f commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

jithunnair-amd and others added 8 commits April 19, 2026 21:19

Update version to 2.12.0

476c6a8

Add related commits with just pytorch and torchvision commits; torcha…

291f197

…udio release/2.12 branch is not available upstream atm

Merge branch 'pytorch:release/2.12' into release/2.12

8652556

Add torchaudio commit in related_commits

a7d6c2f

Use latest main commit for now, since no release/2.12 branch exists for torchaudio

Added a guard in _lazy_call to avoid deadlocks

a34c59a

Applied linter suggestions

56e8ae7

Refactored the test

5ce1838

iupaikov-amd changed the title ~~Iupaikov lazy call deadlock fix release2.12~~ [release/2.12] Fix reentrant deadlock in torch.cuda._lazy_call May 11, 2026

jithunnair-amd force-pushed the release/2.12 branch from f0dff63 to 5379cbe Compare May 15, 2026 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release/2.12] Fix reentrant deadlock in torch.cuda._lazy_call#3212

[release/2.12] Fix reentrant deadlock in torch.cuda._lazy_call#3212
iupaikov-amd wants to merge 8 commits into
release/2.12from
iupaikov_lazy_call_deadlock_fix_release2.12

iupaikov-amd commented May 11, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api Bot commented May 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

iupaikov-amd commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

iupaikov-amd commented May 11, 2026 •

edited

Loading

rocm-repo-management-api Bot commented May 11, 2026 •

edited

Loading