You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
PR #537 migrates the AICPU dispatcher from CANN 6.x RuntimeAicpuKernelLaunchExWithArgs (path A) to CANN 7.0+ rtsBinaryLoadFromFile + rtsLaunchCpuKernel (path B). The motivation is to enable a single host process binding both host_build_graph and tensormap_and_ringbuffer runtimes — path A is blocked by a process-wide one-shot firstCreatSo_ latch inside CANN preinstalled libaicpu_processer.so::BackendServerHandleManager::SaveSoFile, which makes loading a second runtime's inner SO a silent no-op.
In path B, the JSON descriptor uses opKernelLib=AICPUKernel + userDefined="True", which CANN routes to KERNEL_TYPE_AICPU_CUSTOM (4) (cann/runtime/src/runtime/core/src/kernel/program_common.cc). Per ae_so_manager.cc::GetSoPath, KERNEL_TYPE_AICPU_CUSTOM is the only path that searches /home/CustAiCpuUser/cust_aicpu_<dev>_<vf>_<pid>/ (where our uploaded SO actually lands); all other types search /usr/lib64/aicpu_kernels/... which is root-owned and unwritable from a user process. A gate at ae_so_manager.cc:514 (IsCustAicpuSd()) also enforces that KERNEL_TYPE_AICPU_CUSTOM MUST execute inside the aicpu_custom_scheduler subprocess.
Everything routes through correctly: CANN dispatches our Dyn* exports to the cust subprocess, our libsimpler_aicpu_<runtime>.so is dlopen'd, three phases (Null/Init/Run) all reach our code, and SchedulerContext::handshake_all_cores step 1 writes complete to all 9 cores' Handshake slots in shared HBM. AICPU writes are visible to host (verified via aclrtMemcpy DEVICE_TO_HOST readback). AICore stream dispatches, runs past its phase 1, writes aicore_regs_ready=1 back to HBM (also confirmed via host D2H).
The bug: the cust AICPU's L1 cache holds a stale 0 for the aicore_regs_ready field, even though HBM and host both see 1. SchedulerContext::handshake_all_cores step 2 spin loop never observes the update:
After 2 s the host aclrtSynchronizeStreamWithTimeout(stream_aicpu_) reports ACL_ERROR_RT_AICPU_EXCEPTION (507018). Mode A (path A) does not exhibit the bug because the main aicpu_scheduler shares a cache coherency domain with AICore; the cust subprocess gets bound (SetAffinity) to a different AICPU cluster whose L1 is not snooped by AICore HBM writes.
User-space workarounds attempted (all fail):
Attempt
Result
volatile uint32_t field qualifier
No effect — prevents register reuse, not L1 cache
__atomic_load_n(..., __ATOMIC_ACQUIRE) (→ ldar)
No effect — only an ordering instruction, still reads L1
dc civac (clean + invalidate) in spin loop
Worse — same cache line co-hosts AICPU-written aicpu_ready/task and AICore-written aicore_regs_ready; civac writes back AICPU's dirty stale view, clobbering AICore's HBM writes
dc ivac (invalidate-only) in spin loop
Silently NOP'd from EL0 (SCTLR_EL1.UCI=0 in kernel)
Steps to Reproduce
Apply the PR Feat: AICPU launch via dispatcher bootstrap and per-task rtsLaunchCpuKernel #537 mode B refactor (single SO libsimpler_aicpu_<runtime>.so with merged outer dispatcher + inner runtime kernels; JSON opKernelLib=AICPUKernel + userDefined=True for all three phases — Null/Init/Run; orch SO candidate_dirs[] adds /home/CustAiCpuUser as first entry)
Output: workers[0] readback: aicpu_ready=1 task=0x... aicore_regs_ready=1 — both handshake bits set in HBM, both visible to host, but cust AICPU never sees the AICore-written aicore_regs_ready=1.
Expected Behavior
tensormap_and_ringbuffer vector example passes end-to-end with kernelType=AICPU_CUSTOM (4) routing. Multi-runtime in a single host process becomes possible (one ChipWorker process binds both host_build_graph and tensormap_and_ringbuffer), which is the entire motivation for path B over path A.
Device log shows our dispatcher and inner kernel running cleanly in the cust subprocess; the deadlock is purely the AICPU-side read-side cache miss on AICore-written HBM data. The bug is not in user-space code and cannot be fixed there: the four standard AArch64 cache-bypass primitives all fail as documented above.
Possible fix directions
(Listed by where the change has to live — none can be done purely in this repo's user-space.)
#
Where
Change
A
CANN device kernel / driver
Enable EL0 dc ivac (set SCTLR_EL1.UCI=1 for the aicpu_custom_scheduler process). Smallest change, user-side spin loops can then explicitly invalidate.
B
CANN runtime / driver
Allocate Handshake HBM (runtime->workers) with non-cacheable / write-through attribute when called from aicpu_custom_scheduler context. Slight per-access HBM latency cost.
C
CANN cust scheduler
Pin aicpu_custom_scheduler worker threads to the same AICPU cluster as AICore's snoop domain (today aicpusd_worker.cpp::SetAffinity binds them to a separate cpuId=0).
D
simpler runtime (this repo)
Split Handshake so AICPU-written and AICore-written fields live on disjoint cache lines, then make AICPU spin loops dc civac only the AICore-written line — but EL0 invalidate semantics still apply, so this only works in combination with A or B. Alternatively, replace the spin-wait protocol with a device event/notify primitive that bypasses shared-memory polling entirely (substantial runtime refactor).
A/B/C are CANN-side; D is user-side but on its own is insufficient. We'd appreciate guidance from the CANN team on whether A or B is feasible for the cust subprocess in a near-term release.
Git Commit ID
ec7363a2eb6fbed4d71f848e1532dbcef7adc6c8
CANN Version
26.0.rc1 (V100R001C10SPC001B257)
Driver Version
26.0.rc1 (ascendhal 7.35.23)
Host Platform
Linux (aarch64)
Additional Context
Related open issues (different surface, may share root once A/B/C lands):
Concrete D2H diagnostic + per-iteration markers + cust subprocess routing analysis lived in the long debug session that produced this issue; the architecture diagram and CANN source pointers are in .docs/ISSUE-mode-b-cache-coherency.md of the PR #537 worktree. CANN open-source references:
Platform
a2a3 (Ascend 910B/C hardware)
Runtime Variant
tensormap_and_ringbuffer
Description
PR #537 migrates the AICPU dispatcher from CANN 6.x
RuntimeAicpuKernelLaunchExWithArgs(path A) to CANN 7.0+rtsBinaryLoadFromFile + rtsLaunchCpuKernel(path B). The motivation is to enable a single host process binding bothhost_build_graphandtensormap_and_ringbufferruntimes — path A is blocked by a process-wide one-shotfirstCreatSo_latch inside CANN preinstalledlibaicpu_processer.so::BackendServerHandleManager::SaveSoFile, which makes loading a second runtime's inner SO a silent no-op.In path B, the JSON descriptor uses
opKernelLib=AICPUKernel + userDefined="True", which CANN routes toKERNEL_TYPE_AICPU_CUSTOM (4)(cann/runtime/src/runtime/core/src/kernel/program_common.cc). Perae_so_manager.cc::GetSoPath,KERNEL_TYPE_AICPU_CUSTOMis the only path that searches/home/CustAiCpuUser/cust_aicpu_<dev>_<vf>_<pid>/(where our uploaded SO actually lands); all other types search/usr/lib64/aicpu_kernels/...which is root-owned and unwritable from a user process. A gate atae_so_manager.cc:514(IsCustAicpuSd()) also enforces thatKERNEL_TYPE_AICPU_CUSTOMMUST execute inside theaicpu_custom_schedulersubprocess.Everything routes through correctly: CANN dispatches our
Dyn*exports to the cust subprocess, ourlibsimpler_aicpu_<runtime>.sois dlopen'd, three phases (Null/Init/Run) all reach our code, andSchedulerContext::handshake_all_coresstep 1 writes complete to all 9 cores'Handshakeslots in shared HBM. AICPU writes are visible to host (verified viaaclrtMemcpy DEVICE_TO_HOSTreadback). AICore stream dispatches, runs past its phase 1, writesaicore_regs_ready=1back to HBM (also confirmed via host D2H).The bug: the cust AICPU's L1 cache holds a stale 0 for the
aicore_regs_readyfield, even though HBM and host both see 1.SchedulerContext::handshake_all_coresstep 2 spin loop never observes the update:After 2 s the host
aclrtSynchronizeStreamWithTimeout(stream_aicpu_)reportsACL_ERROR_RT_AICPU_EXCEPTION (507018). Mode A (path A) does not exhibit the bug because the mainaicpu_schedulershares a cache coherency domain with AICore; the cust subprocess gets bound (SetAffinity) to a different AICPU cluster whose L1 is not snooped by AICore HBM writes.User-space workarounds attempted (all fail):
volatile uint32_tfield qualifier__atomic_load_n(..., __ATOMIC_ACQUIRE)(→ldar)dc civac(clean + invalidate) in spin loopaicpu_ready/taskand AICore-writtenaicore_regs_ready; civac writes back AICPU's dirty stale view, clobbering AICore's HBM writesdc ivac(invalidate-only) in spin loopSteps to Reproduce
libsimpler_aicpu_<runtime>.sowith merged outer dispatcher + inner runtime kernels; JSONopKernelLib=AICPUKernel + userDefined=Truefor all three phases — Null/Init/Run; orch SOcandidate_dirs[]adds/home/CustAiCpuUseras first entry)device_runner.cpp::runafter the timeout shows HBM has the right values:workers[0] readback: aicpu_ready=1 task=0x... aicore_regs_ready=1— both handshake bits set in HBM, both visible to host, but cust AICPU never sees the AICore-writtenaicore_regs_ready=1.Expected Behavior
tensormap_and_ringbuffervector example passes end-to-end withkernelType=AICPU_CUSTOM (4)routing. Multi-runtime in a single host process becomes possible (one ChipWorker process binds bothhost_build_graphandtensormap_and_ringbuffer), which is the entire motivation for path B over path A.Actual Behavior
Device log shows our dispatcher and inner kernel running cleanly in the cust subprocess; the deadlock is purely the AICPU-side read-side cache miss on AICore-written HBM data. The bug is not in user-space code and cannot be fixed there: the four standard AArch64 cache-bypass primitives all fail as documented above.
Possible fix directions
(Listed by where the change has to live — none can be done purely in this repo's user-space.)
dc ivac(setSCTLR_EL1.UCI=1for theaicpu_custom_schedulerprocess). Smallest change, user-side spin loops can then explicitly invalidate.HandshakeHBM (runtime->workers) with non-cacheable / write-through attribute when called fromaicpu_custom_schedulercontext. Slight per-access HBM latency cost.aicpu_custom_schedulerworker threads to the same AICPU cluster as AICore's snoop domain (todayaicpusd_worker.cpp::SetAffinitybinds them to a separatecpuId=0).Handshakeso AICPU-written and AICore-written fields live on disjoint cache lines, then make AICPU spin loopsdc civaconly the AICore-written line — but EL0 invalidate semantics still apply, so this only works in combination with A or B. Alternatively, replace the spin-wait protocol with a device event/notify primitive that bypasses shared-memory polling entirely (substantial runtime refactor).A/B/C are CANN-side; D is user-side but on its own is insufficient. We'd appreciate guidance from the CANN team on whether A or B is feasible for the cust subprocess in a near-term release.
Git Commit ID
ec7363a2eb6fbed4d71f848e1532dbcef7adc6c8
CANN Version
26.0.rc1 (V100R001C10SPC001B257)
Driver Version
26.0.rc1 (ascendhal 7.35.23)
Host Platform
Linux (aarch64)
Additional Context
Related open issues (different surface, may share root once A/B/C lands):
Concrete D2H diagnostic + per-iteration markers + cust subprocess routing analysis lived in the long debug session that produced this issue; the architecture diagram and CANN source pointers are in
.docs/ISSUE-mode-b-cache-coherency.mdof the PR #537 worktree. CANN open-source references:cann/runtime/src/runtime/core/src/kernel/program_common.cc—opKernelLib→kernelTypetablecann/runtime/src/aicpu_sched/aicpu_processer/ae_so_manager.cc—GetSoPath,LoadSocust-vs-inner routing and theIsCustAicpuSdgatecann/runtime/src/aicpu_sched/aicpu_schedule/core/aicpusd_worker.cpp—SetAffinitybinding worker threads to specific AICPU corescann/runtime/src/aicpu_sched/aicpu_schedule/core/aicpusd_cust_so_manager.cpp— cust SO upload to/home/CustAiCpuUser/cust_aicpu_*/