Feat: AICPU dispatcher bootstrap + cached AICore rtRegisterAllKernel handle (re-apply #537)#870
Merged
ChaoWao merged 1 commit intoMay 27, 2026
Conversation
…handle Re-applies PR hw-native-sys#537 (reverted in PR hw-native-sys#867 because the prepared_callable TestPreparedCallableHbgA5 suite OOM'd at first AICore launch on a5/onboard) on top of a fix for the underlying leak that PR hw-native-sys#537 exposed. ## The bug PR hw-native-sys#537 surfaced (and this PR fixes) `launch_aicore_kernel` was calling `rtRegisterAllKernel` on every run, binding the returned `bin_handle` to a stack-local that vanished at function exit. CANN has no public `rtUnregisterAllKernel`, so each register pinned another device-side copy of the AICore ELF (~365 KB on a5/hbg) and there was no path to ever release it. The leak was pre-PR-537 too but masked by lower steady-state HBM use. PR hw-native-sys#537 made `rtsBinaryLoadFromFile` keep the AICPU SO loaded for the DeviceRunner lifetime — enough extra resident HBM that the very first AICore launch on a5/hbg tipped into 207001 (ACL_ERROR_RT_MEMORY_ALLOCATION) and the broken driver state cascaded into 507899 at the next `rtStreamCreate`. a2a3 stayed lucky because its AICore ELF is ~5x smaller (78 KB vs 365 KB on a5 — MIX-mode binary + heavier debug info on a5 — `.text` is 10.8 KB vs 2.7 KB). ## Fix Cache the AICore `rtRegisterAllKernel` handle in `aicore_bin_handle_` and register lazily on first `launch_aicore_kernel`. Reset to nullptr in `finalize()`; CANN releases the device-side state implicitly when the device context tears down. Applied symmetrically to a2a3 and a5 — a2a3 had the same latent leak, fixing only a5 would leave it as a time-bomb the next time HBM headroom shrinks elsewhere. ## What's the same as PR hw-native-sys#537 Everything else: dispatcher SO build (libsimpler_aicpu_dispatcher.so per-arch), LoadAicpuOp bootstrap + per-task rtsLaunchCpuKernel, content-fingerprinted simpler_inner_<fp>.so preinstall write, process-level fingerprint cache, RuntimeBinaries.dispatcher_path threading. ## Verification Built locally on a5, ran on device 2: - tests/st/a5/host_build_graph: 7 passed (incl. all 5 TestPreparedCallableHbgA5 cases that originally failed) - tests/st/a5/tensormap_and_ringbuffer + examples: 22 passed (the 2 sim-only failures are pre-existing g++-15 env issues unrelated to this change) Fixes hw-native-sys#356 (closes the gap that caused hw-native-sys#867). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Re-applies #537 (reverted in #867 because the prepared_callable
TestPreparedCallableHbgA5suite OOM'd at first AICore launch on a5/onboard)on top of a fix for the underlying leak that #537 exposed.
The bug that caused #867
launch_aicore_kernelwas callingrtRegisterAllKernelon every run, bindingthe returned
bin_handleto a stack-local that vanished at function exit.CANN has no public
rtUnregisterAllKernel, so each register pinned anotherdevice-side copy of the AICore ELF (~365 KB on a5/hbg) — the kernel binary
leaked once per
run(). The leak was pre-#537 too, but masked by lowersteady-state HBM use. #537 made
rtsBinaryLoadFromFilekeep the AICPU SOloaded for the DeviceRunner lifetime — enough extra resident HBM that the
very first AICore launch on a5/hbg tipped into
207001(
ACL_ERROR_RT_MEMORY_ALLOCATION), and the broken driver state cascaded into507899at the nextrtStreamCreate.a2a3 stayed lucky because its AICore ELF is ~5x smaller (78 KB vs 365 KB on
a5 — MIX-mode binary + heavier debug info on a5;
.textalone is 10.8 KB vs2.7 KB).
Fix
Cache the AICore
rtRegisterAllKernelhandle inaicore_bin_handle_andregister lazily on the first
launch_aicore_kernel. Reset tonullptrinfinalize(); CANN releases the device-side state implicitly when the devicecontext tears down. Applied symmetrically to a2a3 and a5 — a2a3 had the same
latent leak.
What's the same as #537
Everything else: dispatcher SO build (
libsimpler_aicpu_dispatcher.soper-arch),
LoadAicpuOpbootstrap + per-taskrtsLaunchCpuKernel,content-fingerprinted
simpler_inner_<fp>.sopreinstall write, process-levelfingerprint cache,
RuntimeBinaries.dispatcher_paththreading.Comments were also audited to drop "Mode A / Mode B" wording (development
shorthand that wasn't useful for readers) and to correct the "load-bearing"
provenance note on
AicpuSoInfo(the previous comment attributed207001/507899to droppingaicpu_so_bin/len; the real cause was the AICore handleleak — but the H2D allocation does still appear independently load-bearing,
so it stays).
Testing
TestPreparedCallableHbgA5cases that originally failed)g++-15 sim-only failures unrelated to this change)
Fixes #356 (closes the gap that caused #867).