Skip to content

[Bug] Mode B (kernelType=AICPU_CUSTOM): cust_aicpu_sd subprocess cache stale on AICore HBM writes → AICPU handshake deadlock (507018) #822

@hw-native-sys-bot

Description

@hw-native-sys-bot

Platform

a2a3 (Ascend 910B/C hardware)

Runtime Variant

tensormap_and_ringbuffer

Description

PR #537 migrates the AICPU dispatcher from CANN 6.x RuntimeAicpuKernelLaunchExWithArgs (path A) to CANN 7.0+ rtsBinaryLoadFromFile + rtsLaunchCpuKernel (path B). The motivation is to enable a single host process binding both host_build_graph and tensormap_and_ringbuffer runtimes — path A is blocked by a process-wide one-shot firstCreatSo_ latch inside CANN preinstalled libaicpu_processer.so::BackendServerHandleManager::SaveSoFile, which makes loading a second runtime's inner SO a silent no-op.

In path B, the JSON descriptor uses opKernelLib=AICPUKernel + userDefined="True", which CANN routes to KERNEL_TYPE_AICPU_CUSTOM (4) (cann/runtime/src/runtime/core/src/kernel/program_common.cc). Per ae_so_manager.cc::GetSoPath, KERNEL_TYPE_AICPU_CUSTOM is the only path that searches /home/CustAiCpuUser/cust_aicpu_<dev>_<vf>_<pid>/ (where our uploaded SO actually lands); all other types search /usr/lib64/aicpu_kernels/... which is root-owned and unwritable from a user process. A gate at ae_so_manager.cc:514 (IsCustAicpuSd()) also enforces that KERNEL_TYPE_AICPU_CUSTOM MUST execute inside the aicpu_custom_scheduler subprocess.

Everything routes through correctly: CANN dispatches our Dyn* exports to the cust subprocess, our libsimpler_aicpu_<runtime>.so is dlopen'd, three phases (Null/Init/Run) all reach our code, and SchedulerContext::handshake_all_cores step 1 writes complete to all 9 cores' Handshake slots in shared HBM. AICPU writes are visible to host (verified via aclrtMemcpy DEVICE_TO_HOST readback). AICore stream dispatches, runs past its phase 1, writes aicore_regs_ready=1 back to HBM (also confirmed via host D2H).

The bug: the cust AICPU's L1 cache holds a stale 0 for the aicore_regs_ready field, even though HBM and host both see 1. SchedulerContext::handshake_all_cores step 2 spin loop never observes the update:

// src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
while (hank->aicore_regs_ready == 0) {}   // ← cust AICPU stuck here, HBM has 1

After 2 s the host aclrtSynchronizeStreamWithTimeout(stream_aicpu_) reports ACL_ERROR_RT_AICPU_EXCEPTION (507018). Mode A (path A) does not exhibit the bug because the main aicpu_scheduler shares a cache coherency domain with AICore; the cust subprocess gets bound (SetAffinity) to a different AICPU cluster whose L1 is not snooped by AICore HBM writes.

User-space workarounds attempted (all fail):

Attempt Result
volatile uint32_t field qualifier No effect — prevents register reuse, not L1 cache
__atomic_load_n(..., __ATOMIC_ACQUIRE) (→ ldar) No effect — only an ordering instruction, still reads L1
dc civac (clean + invalidate) in spin loop Worse — same cache line co-hosts AICPU-written aicpu_ready/task and AICore-written aicore_regs_ready; civac writes back AICPU's dirty stale view, clobbering AICore's HBM writes
dc ivac (invalidate-only) in spin loop Silently NOP'd from EL0 (SCTLR_EL1.UCI=0 in kernel)

Steps to Reproduce

  1. Apply the PR Feat: AICPU launch via dispatcher bootstrap and per-task rtsLaunchCpuKernel #537 mode B refactor (single SO libsimpler_aicpu_<runtime>.so with merged outer dispatcher + inner runtime kernels; JSON opKernelLib=AICPUKernel + userDefined=True for all three phases — Null/Init/Run; orch SO candidate_dirs[] adds /home/CustAiCpuUser as first entry)
  2. Run the simplest a2a3 onboard example:
    python3 -m venv --system-site-packages .venv
    source .venv/bin/activate
    pip install --no-build-isolation -e .
    python examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py -p a2a3
  3. Test fails with timeout. Optional D2H diagnostic in device_runner.cpp::run after the timeout shows HBM has the right values:
    if (rc == ACL_ERROR_RT_STREAM_SYNC_TIMEOUT || rc == 507018) {
        Handshake h0 = {};
        aclrtMemcpy(&h0, sizeof(h0),
                    reinterpret_cast<const uint8_t*>(kernel_args_.args.runtime_args) + offsetof(Runtime, workers),
                    sizeof(h0), ACL_MEMCPY_DEVICE_TO_HOST);
        LOG_ERROR("workers[0] readback: aicpu_ready=%u task=0x%lx aicore_regs_ready=%u",
                  h0.aicpu_ready, (uint64_t)h0.task, h0.aicore_regs_ready);
    }
    Output: workers[0] readback: aicpu_ready=1 task=0x... aicore_regs_ready=1 — both handshake bits set in HBM, both visible to host, but cust AICPU never sees the AICore-written aicore_regs_ready=1.

Expected Behavior

tensormap_and_ringbuffer vector example passes end-to-end with kernelType=AICPU_CUSTOM (4) routing. Multi-runtime in a single host process becomes possible (one ChipWorker process binds both host_build_graph and tensormap_and_ringbuffer), which is the entire motivation for path B over path A.

Actual Behavior

=== Runtime: tensormap_and_ringbuffer  Level: 2 ===
  TestVectorExample::default ... [ERROR] run: [device_runner.cpp:797] aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
FAILED: run_prepared failed with code 507018

Device log shows our dispatcher and inner kernel running cleanly in the cust subprocess; the deadlock is purely the AICPU-side read-side cache miss on AICore-written HBM data. The bug is not in user-space code and cannot be fixed there: the four standard AArch64 cache-bypass primitives all fail as documented above.

Possible fix directions

(Listed by where the change has to live — none can be done purely in this repo's user-space.)

# Where Change
A CANN device kernel / driver Enable EL0 dc ivac (set SCTLR_EL1.UCI=1 for the aicpu_custom_scheduler process). Smallest change, user-side spin loops can then explicitly invalidate.
B CANN runtime / driver Allocate Handshake HBM (runtime->workers) with non-cacheable / write-through attribute when called from aicpu_custom_scheduler context. Slight per-access HBM latency cost.
C CANN cust scheduler Pin aicpu_custom_scheduler worker threads to the same AICPU cluster as AICore's snoop domain (today aicpusd_worker.cpp::SetAffinity binds them to a separate cpuId=0).
D simpler runtime (this repo) Split Handshake so AICPU-written and AICore-written fields live on disjoint cache lines, then make AICPU spin loops dc civac only the AICore-written line — but EL0 invalidate semantics still apply, so this only works in combination with A or B. Alternatively, replace the spin-wait protocol with a device event/notify primitive that bypasses shared-memory polling entirely (substantial runtime refactor).

A/B/C are CANN-side; D is user-side but on its own is insufficient. We'd appreciate guidance from the CANN team on whether A or B is feasible for the cust subprocess in a near-term release.

Git Commit ID

ec7363a2eb6fbed4d71f848e1532dbcef7adc6c8

CANN Version

26.0.rc1 (V100R001C10SPC001B257)

Driver Version

26.0.rc1 (ascendhal 7.35.23)

Host Platform

Linux (aarch64)

Additional Context

Related open issues (different surface, may share root once A/B/C lands):

Concrete D2H diagnostic + per-iteration markers + cust subprocess routing analysis lived in the long debug session that produced this issue; the architecture diagram and CANN source pointers are in .docs/ISSUE-mode-b-cache-coherency.md of the PR #537 worktree. CANN open-source references:

  • cann/runtime/src/runtime/core/src/kernel/program_common.ccopKernelLibkernelType table
  • cann/runtime/src/aicpu_sched/aicpu_processer/ae_so_manager.ccGetSoPath, LoadSo cust-vs-inner routing and the IsCustAicpuSd gate
  • cann/runtime/src/aicpu_sched/aicpu_schedule/core/aicpusd_worker.cppSetAffinity binding worker threads to specific AICPU cores
  • cann/runtime/src/aicpu_sched/aicpu_schedule/core/aicpusd_cust_so_manager.cpp — cust SO upload to /home/CustAiCpuUser/cust_aicpu_*/

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions