Skip to content

[Issue]: hipGraphAddMemcpyNode1D dependency not enforced #3882

@kaschau

Description

@kaschau

Problem Description

Description

hipGraphAddMemcpyNode1D (device-to-device) does not enforce its declared dependencies on the first launch after hipGraphInstantiate. Dependent kernel nodes execute before the memcpy completes, reading stale data from the destination buffer. Subsequent graph launches (replays) execute correctly.

The bug is triggered when the graph contains:

  1. A D2D hipGraphAddMemcpyNode1D node with a kernel node dependency
  2. A kernel node that depends on the D2D memcpy node (writes to the same destination buffer)
  3. Concurrent D2H hipGraphAddMemcpyNode1D nodes with hipGraphAddEventRecordNode nodes in the same graph
  4. Large memcpy sizes (~100–200 MB)

Removing condition (3) — the concurrent D2H memcpy + event record chains — prevents the bug from manifesting. Replacing the D2D hipGraphAddMemcpyNode1D with an equivalent kernel-based copy also works around the issue.

Operating System

"Red Hat Enterprise Linux" VERSION="9.7 (Plow)"

CPU

AMD EPYC 7282 16-Core Processor

GPU

AMD Instinct MI100 (gfx908, amdgcn-amd-amdhsa--gfx908:sramecc+:xnack-)

ROCm Version

7.0.0

ROCm Component

No response

Steps to Reproduce

Reproducer

Attached: hip_graph_memcpy_bug.cpp

hipcc -O2 -o hip_graph_memcpy_bug hip_graph_memcpy_bug.cpp
./hip_graph_memcpy_bug

Expected output: PASS: all 500 trials correct
Actual output:

FAIL trial   0: 14025072 / 28167360 elements wrong

FAILED: 1 / 500 trials had errors

The failure is deterministic on trial 0 (first launch after instantiation). All subsequent replays pass. Running with NTRIALS=1 reproduces the failure 100% of the time.

Graph structure

The reproducer builds a graph that mirrors a real CFD solver (PyFR) gradient computation:

KERNEL A1 (no deps)     KERNEL A2 (no deps)
    |                       |
    |  +-- PACK_KERNEL -----+---> D2H MEMCPY ---> EVENT_RECORD  (x9)
    |  |
    v  v
  D2D MEMCPY B1           D2D MEMCPY B2
  (112 MB)                (204 MB)
    |                       |
    +-------+-------+-------+
            |       |
            v       v
         KERNEL C (x3, writes to same buffer as B1)
                    |
                    v
              CHECK KERNEL

KERNEL D1 (no deps)     KERNEL D2 (no deps)
  • A1, A2: Fill source buffers (like sparse matrix-vector multiply)
  • B1, B2: D2D memcpy, depending on A1/A2 respectively
  • 9x pack chains: pack kernel → D2H memcpy → event record (depending on A1+A2)
  • C: Overwrites first half of B1's destination buffer (depending on both B1 and B2)
  • D1, D2: Independent heavy kernels with no dependencies
  • Check: Verifies B1's destination — first half should match C's writes, second half should match B1's memcpy

The failure shows ~50% of elements wrong, consistent with the D2D memcpy (B1) executing after kernel C — the memcpy overwrites C's output, violating the declared dependency B1 → C.

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module version 6.12.12 is loaded

HSA System Attributes

Runtime Version: 1.18
Runtime Ext Version: 1.11
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
XNACK enabled: NO
DMAbuf Support: YES
VMM Support: YES

Additional Information

hip_graph_memcpy_bug.cpp

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions