[Bug] Mode B (kernelType=AICPU_CUSTOM): cust_aicpu_sd subprocess cache stale on AICore HBM writes → AICPU handshake deadlock (507018)

### Platform

a2a3 (Ascend 910B/C hardware)

### Runtime Variant

tensormap_and_ringbuffer

### Description

PR #537 migrates the AICPU dispatcher from CANN 6.x `RuntimeAicpuKernelLaunchExWithArgs` (path A) to CANN 7.0+ `rtsBinaryLoadFromFile + rtsLaunchCpuKernel` (path B). The motivation is to enable **a single host process binding both `host_build_graph` and `tensormap_and_ringbuffer` runtimes** — path A is blocked by a process-wide one-shot `firstCreatSo_` latch inside CANN preinstalled `libaicpu_processer.so::BackendServerHandleManager::SaveSoFile`, which makes loading a second runtime's inner SO a silent no-op.

In path B, the JSON descriptor uses `opKernelLib=AICPUKernel + userDefined="True"`, which CANN routes to `KERNEL_TYPE_AICPU_CUSTOM (4)` (`cann/runtime/src/runtime/core/src/kernel/program_common.cc`). Per `ae_so_manager.cc::GetSoPath`, `KERNEL_TYPE_AICPU_CUSTOM` is the only path that searches `/home/CustAiCpuUser/cust_aicpu_<dev>_<vf>_<pid>/` (where our uploaded SO actually lands); all other types search `/usr/lib64/aicpu_kernels/...` which is root-owned and unwritable from a user process. A gate at `ae_so_manager.cc:514` (`IsCustAicpuSd()`) also enforces that `KERNEL_TYPE_AICPU_CUSTOM` MUST execute inside the `aicpu_custom_scheduler` subprocess.

Everything routes through correctly: CANN dispatches our `Dyn*` exports to the cust subprocess, our `libsimpler_aicpu_<runtime>.so` is dlopen'd, three phases (Null/Init/Run) all reach our code, and `SchedulerContext::handshake_all_cores` step 1 writes complete to all 9 cores' `Handshake` slots in shared HBM. **AICPU writes are visible to host (verified via `aclrtMemcpy DEVICE_TO_HOST` readback). AICore stream dispatches, runs past its phase 1, writes `aicore_regs_ready=1` back to HBM (also confirmed via host D2H).**

**The bug: the cust AICPU's L1 cache holds a stale 0 for the `aicore_regs_ready` field**, even though HBM and host both see 1. `SchedulerContext::handshake_all_cores` step 2 spin loop never observes the update:

```cpp
// src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/scheduler_cold_path.cpp
while (hank->aicore_regs_ready == 0) {}   // ← cust AICPU stuck here, HBM has 1
```

After 2 s the host `aclrtSynchronizeStreamWithTimeout(stream_aicpu_)` reports `ACL_ERROR_RT_AICPU_EXCEPTION (507018)`. Mode A (path A) does not exhibit the bug because the main `aicpu_scheduler` shares a cache coherency domain with AICore; the cust subprocess gets bound (`SetAffinity`) to a different AICPU cluster whose L1 is not snooped by AICore HBM writes.

User-space workarounds attempted (all fail):

| Attempt | Result |
| --- | --- |
| `volatile uint32_t` field qualifier | No effect — prevents register reuse, not L1 cache |
| `__atomic_load_n(..., __ATOMIC_ACQUIRE)` (→ `ldar`) | No effect — only an ordering instruction, still reads L1 |
| `dc civac` (clean + invalidate) in spin loop | Worse — same cache line co-hosts AICPU-written `aicpu_ready/task` and AICore-written `aicore_regs_ready`; civac writes back AICPU's dirty stale view, clobbering AICore's HBM writes |
| `dc ivac` (invalidate-only) in spin loop | Silently NOP'd from EL0 (SCTLR_EL1.UCI=0 in kernel) |

### Steps to Reproduce

1. Apply the PR #537 mode B refactor (single SO `libsimpler_aicpu_<runtime>.so` with merged outer dispatcher + inner runtime kernels; JSON `opKernelLib=AICPUKernel + userDefined=True` for all three phases — Null/Init/Run; orch SO `candidate_dirs[]` adds `/home/CustAiCpuUser` as first entry)
2. Run the simplest a2a3 onboard example:
   ```bash
   python3 -m venv --system-site-packages .venv
   source .venv/bin/activate
   pip install --no-build-isolation -e .
   python examples/a2a3/tensormap_and_ringbuffer/vector_example/test_vector_example.py -p a2a3
   ```
3. Test fails with timeout. Optional D2H diagnostic in `device_runner.cpp::run` after the timeout shows HBM has the right values:
   ```cpp
   if (rc == ACL_ERROR_RT_STREAM_SYNC_TIMEOUT || rc == 507018) {
       Handshake h0 = {};
       aclrtMemcpy(&h0, sizeof(h0),
                   reinterpret_cast<const uint8_t*>(kernel_args_.args.runtime_args) + offsetof(Runtime, workers),
                   sizeof(h0), ACL_MEMCPY_DEVICE_TO_HOST);
       LOG_ERROR("workers[0] readback: aicpu_ready=%u task=0x%lx aicore_regs_ready=%u",
                 h0.aicpu_ready, (uint64_t)h0.task, h0.aicore_regs_ready);
   }
   ```
   Output: `workers[0] readback: aicpu_ready=1 task=0x... aicore_regs_ready=1` — both handshake bits set in HBM, both visible to host, but cust AICPU never sees the AICore-written `aicore_regs_ready=1`.

### Expected Behavior

`tensormap_and_ringbuffer` vector example passes end-to-end with `kernelType=AICPU_CUSTOM (4)` routing. Multi-runtime in a single host process becomes possible (one ChipWorker process binds both `host_build_graph` and `tensormap_and_ringbuffer`), which is the entire motivation for path B over path A.

### Actual Behavior

```
=== Runtime: tensormap_and_ringbuffer  Level: 2 ===
  TestVectorExample::default ... [ERROR] run: [device_runner.cpp:797] aclrtSynchronizeStreamWithTimeout (AICPU) failed: 507018
FAILED: run_prepared failed with code 507018
```

Device log shows our dispatcher and inner kernel running cleanly in the cust subprocess; the deadlock is purely the AICPU-side read-side cache miss on AICore-written HBM data. The bug is **not** in user-space code and cannot be fixed there: the four standard AArch64 cache-bypass primitives all fail as documented above.

### Possible fix directions

(Listed by where the change has to live — none can be done purely in this repo's user-space.)

| # | Where | Change |
| --- | --- | --- |
| **A** | CANN device kernel / driver | Enable EL0 `dc ivac` (set `SCTLR_EL1.UCI=1` for the `aicpu_custom_scheduler` process). Smallest change, user-side spin loops can then explicitly invalidate. |
| **B** | CANN runtime / driver | Allocate `Handshake` HBM (`runtime->workers`) with non-cacheable / write-through attribute when called from `aicpu_custom_scheduler` context. Slight per-access HBM latency cost. |
| **C** | CANN cust scheduler | Pin `aicpu_custom_scheduler` worker threads to the same AICPU cluster as AICore's snoop domain (today `aicpusd_worker.cpp::SetAffinity` binds them to a separate `cpuId=0`). |
| **D** | simpler runtime (this repo) | Split `Handshake` so AICPU-written and AICore-written fields live on disjoint cache lines, then make AICPU spin loops `dc civac` only the AICore-written line — but EL0 invalidate semantics still apply, so this only works in combination with A or B. Alternatively, replace the spin-wait protocol with a device event/notify primitive that bypasses shared-memory polling entirely (substantial runtime refactor). |

A/B/C are CANN-side; D is user-side but on its own is insufficient. We'd appreciate guidance from the CANN team on whether A or B is feasible for the cust subprocess in a near-term release.

### Git Commit ID

ec7363a2eb6fbed4d71f848e1532dbcef7adc6c8

### CANN Version

26.0.rc1 (V100R001C10SPC001B257)

### Driver Version

26.0.rc1 (ascendhal 7.35.23)

### Host Platform

Linux (aarch64)

### Additional Context

Related open issues (different surface, may share root once A/B/C lands):
- #84 — 507018 in tensormap_and_ringbuffer (different reproducer)
- #266 — cache coherency in handshake on sim
- #480 — handshake failure on 910B3
- #759 — stream timeout on multi-cid dispatch

Concrete D2H diagnostic + per-iteration markers + cust subprocess routing analysis lived in the long debug session that produced this issue; the architecture diagram and CANN source pointers are in `.docs/ISSUE-mode-b-cache-coherency.md` of the PR #537 worktree. CANN open-source references:
- `cann/runtime/src/runtime/core/src/kernel/program_common.cc` — `opKernelLib` → `kernelType` table
- `cann/runtime/src/aicpu_sched/aicpu_processer/ae_so_manager.cc` — `GetSoPath`, `LoadSo` cust-vs-inner routing and the `IsCustAicpuSd` gate
- `cann/runtime/src/aicpu_sched/aicpu_schedule/core/aicpusd_worker.cpp` — `SetAffinity` binding worker threads to specific AICPU cores
- `cann/runtime/src/aicpu_sched/aicpu_schedule/core/aicpusd_cust_so_manager.cpp` — cust SO upload to `/home/CustAiCpuUser/cust_aicpu_*/`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Mode B (kernelType=AICPU_CUSTOM): cust_aicpu_sd subprocess cache stale on AICore HBM writes → AICPU handshake deadlock (507018) #822

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Possible fix directions

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Attempt	Result
`volatile uint32_t` field qualifier	No effect — prevents register reuse, not L1 cache
`__atomic_load_n(..., __ATOMIC_ACQUIRE)` (→ `ldar`)	No effect — only an ordering instruction, still reads L1
`dc civac` (clean + invalidate) in spin loop	Worse — same cache line co-hosts AICPU-written `aicpu_ready/task` and AICore-written `aicore_regs_ready`; civac writes back AICPU's dirty stale view, clobbering AICore's HBM writes
`dc ivac` (invalidate-only) in spin loop	Silently NOP'd from EL0 (SCTLR_EL1.UCI=0 in kernel)

#	Where	Change
A	CANN device kernel / driver	Enable EL0 `dc ivac` (set `SCTLR_EL1.UCI=1` for the `aicpu_custom_scheduler` process). Smallest change, user-side spin loops can then explicitly invalidate.
B	CANN runtime / driver	Allocate `Handshake` HBM (`runtime->workers`) with non-cacheable / write-through attribute when called from `aicpu_custom_scheduler` context. Slight per-access HBM latency cost.
C	CANN cust scheduler	Pin `aicpu_custom_scheduler` worker threads to the same AICPU cluster as AICore's snoop domain (today `aicpusd_worker.cpp::SetAffinity` binds them to a separate `cpuId=0`).
D	simpler runtime (this repo)	Split `Handshake` so AICPU-written and AICore-written fields live on disjoint cache lines, then make AICPU spin loops `dc civac` only the AICore-written line — but EL0 invalidate semantics still apply, so this only works in combination with A or B. Alternatively, replace the spin-wait protocol with a device event/notify primitive that bypasses shared-memory polling entirely (substantial runtime refactor).

[Bug] Mode B (kernelType=AICPU_CUSTOM): cust_aicpu_sd subprocess cache stale on AICore HBM writes → AICPU handshake deadlock (507018) #822

Description

Platform

Runtime Variant

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Possible fix directions

Git Commit ID

CANN Version

Driver Version

Host Platform

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions