Revert "Feat: AICPU launch via dispatcher bootstrap and per-task rtsLaunchCpuKernel"#867
Conversation
…aunchCpu…" This reverts commit 27479f2.
There was a problem hiding this comment.
Code Review
This pull request simplifies the AICPU kernel loading and launch path by removing the transient bootstrap-only dispatcher (libsimpler_aicpu_dispatcher.so) and the host-side LoadAicpuOp loader. The simpler_init API has been updated across all platforms to remove dispatcher-related parameters, and AICPU kernels are now launched directly using rtAicpuKernelLaunchExWithArgs. Feedback on these changes highlights several opportunities to add defensive null-pointer checks for input arguments in launch_aicpu_kernel and simpler_init to prevent potential segmentation faults.
| int DeviceRunner::launch_aicpu_kernel(rtStream_t stream, KernelArgs *k_args, const char *kernel_name, int aicpu_num) { | ||
| // kernel_name is host::KernelNames::InitName / RunName — the runtime SO's | ||
| // actual exported symbol (simpler_aicpu_init / simpler_aicpu_exec). The | ||
| // Mode A type 2 launch in LaunchBuiltInOp embeds it in the args struct | ||
| // for the main aicpu_scheduler to dlsym. | ||
| return load_aicpu_op_.LaunchBuiltInOp(stream, k_args, aicpu_num, kernel_name); | ||
| struct Args { | ||
| KernelArgs k_args; | ||
| char kernel_name[32]; | ||
| const char so_name[32] = {"libaicpu_extend_kernels.so"}; | ||
| const char op_name[32] = {""}; | ||
| } args; | ||
|
|
||
| args.k_args = *k_args; | ||
| std::strncpy(args.kernel_name, kernel_name, sizeof(args.kernel_name) - 1); | ||
| args.kernel_name[sizeof(args.kernel_name) - 1] = '\0'; |
There was a problem hiding this comment.
Defensively validate k_args and kernel_name pointers before dereferencing them to prevent potential null pointer dereferences and segmentation faults.
int DeviceRunner::launch_aicpu_kernel(rtStream_t stream, KernelArgs *k_args, const char *kernel_name, int aicpu_num) {
if (k_args == nullptr || kernel_name == nullptr) {
LOG_ERROR("%s", "Invalid arguments: k_args or kernel_name is null");
return -1;
}
struct Args {
KernelArgs k_args;
char kernel_name[32];
const char so_name[32] = {"libaicpu_extend_kernels.so"};
const char op_name[32] = {""};
} args;
args.k_args = *k_args;
std::strncpy(args.kernel_name, kernel_name, sizeof(args.kernel_name) - 1);
args.kernel_name[sizeof(args.kernel_name) - 1] = '\\0';| int DeviceRunner::launch_aicpu_kernel(rtStream_t stream, KernelArgs *k_args, const char *kernel_name, int aicpu_num) { | ||
| // kernel_name is host::KernelNames::InitName / RunName — the runtime SO's | ||
| // actual exported symbol (simpler_aicpu_init / simpler_aicpu_exec). The | ||
| // Mode A type 2 launch in LaunchBuiltInOp embeds it in the args struct | ||
| // for the main aicpu_scheduler to dlsym. | ||
| return load_aicpu_op_.LaunchBuiltInOp(stream, k_args, aicpu_num, kernel_name); | ||
| struct Args { | ||
| KernelArgs k_args; | ||
| char kernel_name[32]; | ||
| const char so_name[32] = {"libaicpu_extend_kernels.so"}; | ||
| const char op_name[32] = {""}; | ||
| } args; | ||
|
|
||
| args.k_args = *k_args; | ||
| std::strncpy(args.kernel_name, kernel_name, sizeof(args.kernel_name) - 1); | ||
| args.kernel_name[sizeof(args.kernel_name) - 1] = '\0'; |
There was a problem hiding this comment.
Defensively validate k_args and kernel_name pointers before dereferencing them to prevent potential null pointer dereferences and segmentation faults.
int DeviceRunner::launch_aicpu_kernel(rtStream_t stream, KernelArgs *k_args, const char *kernel_name, int aicpu_num) {
if (k_args == nullptr || kernel_name == nullptr) {
LOG_ERROR("%s", "Invalid arguments: k_args or kernel_name is null");
return -1;
}
struct Args {
KernelArgs k_args;
char kernel_name[32];
const char so_name[32] = {"libaicpu_extend_kernels.so"};
const char op_name[32] = {""};
} args;
args.k_args = *k_args;
std::strncpy(args.kernel_name, kernel_name, sizeof(args.kernel_name) - 1);
args.kernel_name[sizeof(args.kernel_name) - 1] = '\\0';| int simpler_init( | ||
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | ||
| const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size | ||
| const uint8_t *aicore_binary, size_t aicore_size | ||
| ) { | ||
| if (ctx == NULL) return -1; |
There was a problem hiding this comment.
Defensively validate aicpu_binary and aicore_binary pointers before constructing vectors from them to prevent potential null pointer dereferences.
| int simpler_init( | |
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | |
| const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size | |
| const uint8_t *aicore_binary, size_t aicore_size | |
| ) { | |
| if (ctx == NULL) return -1; | |
| int simpler_init( | |
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | |
| const uint8_t *aicore_binary, size_t aicore_size | |
| ) { | |
| if (ctx == nullptr || aicpu_binary == nullptr || aicore_binary == nullptr) return -1; |
| int simpler_init( | ||
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | ||
| const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size | ||
| const uint8_t *aicore_binary, size_t aicore_size | ||
| ) { | ||
| if (ctx == NULL) return -1; |
There was a problem hiding this comment.
Defensively validate aicpu_binary and aicore_binary pointers before constructing vectors from them to prevent potential null pointer dereferences.
| int simpler_init( | |
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | |
| const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size | |
| const uint8_t *aicore_binary, size_t aicore_size | |
| ) { | |
| if (ctx == NULL) return -1; | |
| int simpler_init( | |
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | |
| const uint8_t *aicore_binary, size_t aicore_size | |
| ) { | |
| if (ctx == nullptr || aicpu_binary == nullptr || aicore_binary == nullptr) return -1; |
| int simpler_init( | ||
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | ||
| const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size | ||
| const uint8_t *aicore_binary, size_t aicore_size | ||
| ) { | ||
| // Sim has no AICPU dispatcher (the simulator runs AICPU in-process). Accept | ||
| // the parameters for ABI parity with the onboard implementation and ignore | ||
| // them — callers that pass dispatcher bytes get the same shape as onboard, | ||
| // and Mode B path on sim isn't taken anyway. | ||
| (void)dispatcher_binary; | ||
| (void)dispatcher_size; | ||
|
|
||
| if (ctx == NULL) return -1; |
There was a problem hiding this comment.
Defensively validate aicpu_binary and aicore_binary pointers before constructing vectors from them to prevent potential null pointer dereferences.
| int simpler_init( | |
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | |
| const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size | |
| const uint8_t *aicore_binary, size_t aicore_size | |
| ) { | |
| // Sim has no AICPU dispatcher (the simulator runs AICPU in-process). Accept | |
| // the parameters for ABI parity with the onboard implementation and ignore | |
| // them — callers that pass dispatcher bytes get the same shape as onboard, | |
| // and Mode B path on sim isn't taken anyway. | |
| (void)dispatcher_binary; | |
| (void)dispatcher_size; | |
| if (ctx == NULL) return -1; | |
| int simpler_init( | |
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | |
| const uint8_t *aicore_binary, size_t aicore_size | |
| ) { | |
| if (ctx == nullptr || aicpu_binary == nullptr || aicore_binary == nullptr) return -1; |
| int simpler_init( | ||
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | ||
| const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size | ||
| const uint8_t *aicore_binary, size_t aicore_size | ||
| ) { | ||
| // Sim has no AICPU dispatcher (the simulator runs AICPU in-process). See | ||
| // a2a3 sim sibling for rationale; parameters accepted for ABI parity. | ||
| (void)dispatcher_binary; | ||
| (void)dispatcher_size; | ||
|
|
||
| if (ctx == NULL) return -1; |
There was a problem hiding this comment.
Defensively validate aicpu_binary and aicore_binary pointers before constructing vectors from them to prevent potential null pointer dereferences.
| int simpler_init( | |
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | |
| const uint8_t *aicore_binary, size_t aicore_size, const uint8_t *dispatcher_binary, size_t dispatcher_size | |
| const uint8_t *aicore_binary, size_t aicore_size | |
| ) { | |
| // Sim has no AICPU dispatcher (the simulator runs AICPU in-process). See | |
| // a2a3 sim sibling for rationale; parameters accepted for ABI parity. | |
| (void)dispatcher_binary; | |
| (void)dispatcher_size; | |
| if (ctx == NULL) return -1; | |
| int simpler_init( | |
| DeviceContextHandle ctx, int device_id, const uint8_t *aicpu_binary, size_t aicpu_size, | |
| const uint8_t *aicore_binary, size_t aicore_size | |
| ) { | |
| if (ctx == nullptr || aicpu_binary == nullptr || aicore_binary == nullptr) return -1; |
…handle (#870) Re-applies PR #537 (reverted in PR #867 because the prepared_callable TestPreparedCallableHbgA5 suite OOM'd at first AICore launch on a5/onboard) on top of a fix for the underlying leak that PR #537 exposed. ## The bug PR #537 surfaced (and this PR fixes) `launch_aicore_kernel` was calling `rtRegisterAllKernel` on every run, binding the returned `bin_handle` to a stack-local that vanished at function exit. CANN has no public `rtUnregisterAllKernel`, so each register pinned another device-side copy of the AICore ELF (~365 KB on a5/hbg) and there was no path to ever release it. The leak was pre-PR-537 too but masked by lower steady-state HBM use. PR #537 made `rtsBinaryLoadFromFile` keep the AICPU SO loaded for the DeviceRunner lifetime — enough extra resident HBM that the very first AICore launch on a5/hbg tipped into 207001 (ACL_ERROR_RT_MEMORY_ALLOCATION) and the broken driver state cascaded into 507899 at the next `rtStreamCreate`. a2a3 stayed lucky because its AICore ELF is ~5x smaller (78 KB vs 365 KB on a5 — MIX-mode binary + heavier debug info on a5 — `.text` is 10.8 KB vs 2.7 KB). ## Fix Cache the AICore `rtRegisterAllKernel` handle in `aicore_bin_handle_` and register lazily on first `launch_aicore_kernel`. Reset to nullptr in `finalize()`; CANN releases the device-side state implicitly when the device context tears down. Applied symmetrically to a2a3 and a5 — a2a3 had the same latent leak, fixing only a5 would leave it as a time-bomb the next time HBM headroom shrinks elsewhere. ## What's the same as PR #537 Everything else: dispatcher SO build (libsimpler_aicpu_dispatcher.so per-arch), LoadAicpuOp bootstrap + per-task rtsLaunchCpuKernel, content-fingerprinted simpler_inner_<fp>.so preinstall write, process-level fingerprint cache, RuntimeBinaries.dispatcher_path threading. ## Verification Built locally on a5, ran on device 2: - tests/st/a5/host_build_graph: 7 passed (incl. all 5 TestPreparedCallableHbgA5 cases that originally failed) - tests/st/a5/tensormap_and_ringbuffer + examples: 22 passed (the 2 sim-only failures are pre-existing g++-15 env issues unrelated to this change) Fixes #356 (closes the gap that caused #867).
Reverts #537