Skip to content

[release/2.12] Fix reentrant deadlock in torch.cuda._lazy_call#3212

Draft
iupaikov-amd wants to merge 8 commits into
release/2.12from
iupaikov_lazy_call_deadlock_fix_release2.12
Draft

[release/2.12] Fix reentrant deadlock in torch.cuda._lazy_call#3212
iupaikov-amd wants to merge 8 commits into
release/2.12from
iupaikov_lazy_call_deadlock_fix_release2.12

Conversation

@iupaikov-amd
Copy link
Copy Markdown

@iupaikov-amd iupaikov-amd commented May 11, 2026

This fixes deadlock with torch.compile while running bigger models in MAD engine.

This is a cherry-pick of upstream PR, do not merge before it's landed: pytorch#182948

jithunnair-amd and others added 8 commits April 19, 2026 21:19
…udio release/2.12 branch is not available upstream atm
…3180)

<h2>Fix MIOpen CTC loss access violation on Windows discrete GPUs</h2>

<h3>Problem</h3>

<p>A failing unit test on Windows started showing a couple weeks ago and
a missing <code>#include</code> was added in
[](pytorch#178284), but CI on TheRock
kept failing. The fix was tested on gfx1151 (APU), where the test
passed, but CI showed failures on gfx1100. </p>

<p><code>test_CTCLoss_no_batch_dim</code> (and any code path hitting
<code>miopen_ctc_loss</code>) crashes with a fatal access violation on
Windows systems with discrete AMD GPUs:</p>

<pre><code>Windows fatal exception: access violation Exception Code:
0xC0000005
#0 miopen::CTCLossDescriptor::GetCTCLossWorkspaceSize
(MIOpen.dll+0x14fde4) #1 miopenGetCTCLossWorkspaceSize
(MIOpen.dll+0x150912) #2 at::native::miopen_ctc_loss (torch_hip.dll)
</code></pre>

<h3>Root Cause</h3>

<p><code>miopenGetCTCLossWorkspaceSize</code> and
<code>miopenCTCLoss</code> read the <code>labels</code>,
<code>label_lengths</code>, and <code>input_lengths</code> arrays
<strong>on the host side</strong> to plan the computation and calculate
workspace requirements. The existing code copies these arrays to GPU
memory and passes device pointers:</p>

<pre><code>Tensor labels_gpu = targets_t.to(Device(at::kCUDA),
at::kInt); // ... hipMemcpy to GPU ...
MIOPEN_CHECK(miopenGetCTCLossWorkspaceSize(...,
    labels_gpu.data_ptr&lt;int&gt;(),          // device pointer
    label_lengths_gpu.data_ptr&lt;int&gt;(),   // device pointer
    input_lengths_gpu.data_ptr&lt;int&gt;()    // device pointer
));
</code></pre>

<p>This works on:</p>
<ul>
<li><strong>Linux</strong> — HSA (Heterogeneous System Architecture)
maps GPU allocations into the process virtual address space, making
device pointers host-readable</li> <li><strong>Windows APUs</strong> —
CPU and iGPU share system RAM, so device pointers point to
host-accessible memory</li> </ul>

<p>This crashes on:</p>
<ul>
<li><strong>Windows dGPUs</strong> — GPU has dedicated VRAM across PCIe;
device pointers are opaque handles that cannot be dereferenced from host
code</li> </ul>

<h3>Verification</h3>

<p>Tested on gfx1201:</p>

<table border="1" cellpadding="6" cellspacing="0">
<tr><th>Check</th><th>Result</th></tr>

<tr><td><code>hipDeviceAttributeIntegrated</code></td><td><code>0</code>
(discrete GPU)</td></tr>
<tr><td><code>hipDeviceAttributeCanUseHostPointerForRegisteredMem</code></td><td><code>0</code></td></tr>
<tr><td><code>hipDeviceAttributeManagedMemory</code></td><td><code>0x7FFFFFFF</code>
(unsupported)</td></tr>
<tr><td><code>hipDeviceAttributeUnifiedAddressing</code></td><td><code>0x7FFFFFFF</code>
(unsupported)</td></tr> <tr><td>Host read of <code>hipMalloc</code>
pointer via <code>ctypes</code></td><td>Access violation</td></tr>
<tr><td>CTC loss with CPU pointers</td><td>Pass (forward +
backward)</td></tr> </table>

<h3>Fix</h3>

<p>Use host pointers since this is what MIOpen expects should be
used.</p>

<h3>Testing</h3>

<p>Run all existing CTCLoss unit tests.</p>

Pull Request resolved: pytorch#179264
Approved by: https://github.com/jeffdaily

Co-authored-by: Milica Stankovic <mstankov@amd.com>
Use latest main commit for now, since no release/2.12 branch exists for torchaudio
@iupaikov-amd iupaikov-amd changed the title Iupaikov lazy call deadlock fix release2.12 [release/2.12] Fix reentrant deadlock in torch.cuda._lazy_call May 11, 2026
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented May 11, 2026

Jenkins build for 5ce1838b180a53c807b801f2eaae6fc3d70e7a4f commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants