Skip to content

Feat: AICPU dispatcher bootstrap + cached AICore rtRegisterAllKernel handle (re-apply #537)#870

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:feat/aicpu-dispatcher-with-aicore-handle-cache
May 27, 2026
Merged

Feat: AICPU dispatcher bootstrap + cached AICore rtRegisterAllKernel handle (re-apply #537)#870
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoWao:feat/aicpu-dispatcher-with-aicore-handle-cache

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 27, 2026

Summary

Re-applies #537 (reverted in #867 because the prepared_callable
TestPreparedCallableHbgA5 suite OOM'd at first AICore launch on a5/onboard)
on top of a fix for the underlying leak that #537 exposed.

The bug that caused #867

launch_aicore_kernel was calling rtRegisterAllKernel on every run, binding
the returned bin_handle to a stack-local that vanished at function exit.
CANN has no public rtUnregisterAllKernel, so each register pinned another
device-side copy of the AICore ELF (~365 KB on a5/hbg) — the kernel binary
leaked once per run(). The leak was pre-#537 too, but masked by lower
steady-state HBM use. #537 made rtsBinaryLoadFromFile keep the AICPU SO
loaded for the DeviceRunner lifetime — enough extra resident HBM that the
very first AICore launch on a5/hbg tipped into 207001
(ACL_ERROR_RT_MEMORY_ALLOCATION), and the broken driver state cascaded into
507899 at the next rtStreamCreate.

a2a3 stayed lucky because its AICore ELF is ~5x smaller (78 KB vs 365 KB on
a5 — MIX-mode binary + heavier debug info on a5; .text alone is 10.8 KB vs
2.7 KB).

Fix

Cache the AICore rtRegisterAllKernel handle in aicore_bin_handle_ and
register lazily on the first launch_aicore_kernel. Reset to nullptr in
finalize(); CANN releases the device-side state implicitly when the device
context tears down. Applied symmetrically to a2a3 and a5 — a2a3 had the same
latent leak.

What's the same as #537

Everything else: dispatcher SO build (libsimpler_aicpu_dispatcher.so
per-arch), LoadAicpuOp bootstrap + per-task rtsLaunchCpuKernel,
content-fingerprinted simpler_inner_<fp>.so preinstall write, process-level
fingerprint cache, RuntimeBinaries.dispatcher_path threading.

Comments were also audited to drop "Mode A / Mode B" wording (development
shorthand that wasn't useful for readers) and to correct the "load-bearing"
provenance note on AicpuSoInfo (the previous comment attributed 207001 /
507899 to dropping aicpu_so_bin/len; the real cause was the AICore handle
leak — but the H2D allocation does still appear independently load-bearing,
so it stays).

Testing

  • tests/st/a5/host_build_graph: 7 passed (incl. all 5
    TestPreparedCallableHbgA5 cases that originally failed)
  • tests/st/a5/tensormap_and_ringbuffer + examples: 22 passed (2 pre-existing
    g++-15 sim-only failures unrelated to this change)
  • CI st-onboard-a5 (this PR)
  • CI st-onboard-a2a3 (this PR)

Fixes #356 (closes the gap that caused #867).

…handle

Re-applies PR hw-native-sys#537 (reverted in PR hw-native-sys#867 because the prepared_callable
TestPreparedCallableHbgA5 suite OOM'd at first AICore launch on a5/onboard)
on top of a fix for the underlying leak that PR hw-native-sys#537 exposed.

## The bug PR hw-native-sys#537 surfaced (and this PR fixes)

`launch_aicore_kernel` was calling `rtRegisterAllKernel` on every run,
binding the returned `bin_handle` to a stack-local that vanished at
function exit. CANN has no public `rtUnregisterAllKernel`, so each
register pinned another device-side copy of the AICore ELF (~365 KB on
a5/hbg) and there was no path to ever release it. The leak was pre-PR-537
too but masked by lower steady-state HBM use. PR hw-native-sys#537 made
`rtsBinaryLoadFromFile` keep the AICPU SO loaded for the DeviceRunner
lifetime — enough extra resident HBM that the very first AICore launch
on a5/hbg tipped into 207001 (ACL_ERROR_RT_MEMORY_ALLOCATION) and the
broken driver state cascaded into 507899 at the next `rtStreamCreate`.

a2a3 stayed lucky because its AICore ELF is ~5x smaller (78 KB vs 365 KB
on a5 — MIX-mode binary + heavier debug info on a5 — `.text` is 10.8 KB
vs 2.7 KB).

## Fix

Cache the AICore `rtRegisterAllKernel` handle in `aicore_bin_handle_` and
register lazily on first `launch_aicore_kernel`. Reset to nullptr in
`finalize()`; CANN releases the device-side state implicitly when the
device context tears down. Applied symmetrically to a2a3 and a5 — a2a3
had the same latent leak, fixing only a5 would leave it as a time-bomb
the next time HBM headroom shrinks elsewhere.

## What's the same as PR hw-native-sys#537

Everything else: dispatcher SO build (libsimpler_aicpu_dispatcher.so
per-arch), LoadAicpuOp bootstrap + per-task rtsLaunchCpuKernel,
content-fingerprinted simpler_inner_<fp>.so preinstall write,
process-level fingerprint cache, RuntimeBinaries.dispatcher_path
threading.

## Verification

Built locally on a5, ran on device 2:
  - tests/st/a5/host_build_graph: 7 passed (incl. all 5
    TestPreparedCallableHbgA5 cases that originally failed)
  - tests/st/a5/tensormap_and_ringbuffer + examples: 22 passed (the 2
    sim-only failures are pre-existing g++-15 env issues unrelated to
    this change)

Fixes hw-native-sys#356 (closes the gap that caused hw-native-sys#867).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@ChaoWao ChaoWao merged commit f8b7285 into hw-native-sys:main May 27, 2026
49 of 90 checks passed
@ChaoWao ChaoWao deleted the feat/aicpu-dispatcher-with-aicore-handle-cache branch May 27, 2026 07:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Migrate AICPU launch to new rtsLaunchCpuKernel interface (BUILD_WITH_NEW_CANN)

1 participant